-
Application of Data-Driven Model Predictive Control for Autonomous Vehicle Steering
Authors:
Jiarui Zhang,
Aijing Kong,
Yu Tang,
Zhichao Lv,
Lulu Guo,
Peng Hang
Abstract:
With the development of autonomous driving technology, there are increasing demands for vehicle control, and MPC has become a widely researched topic in both industry and academia. Existing MPC control methods based on vehicle kinematics or dynamics have challenges such as difficult modeling, numerous parameters, strong nonlinearity, and high computational cost. To address these issues, this paper…
▽ More
With the development of autonomous driving technology, there are increasing demands for vehicle control, and MPC has become a widely researched topic in both industry and academia. Existing MPC control methods based on vehicle kinematics or dynamics have challenges such as difficult modeling, numerous parameters, strong nonlinearity, and high computational cost. To address these issues, this paper adapts an existing Data-driven MPC control method and applies it to autonomous vehicle steering control. This method avoids the need for complex vehicle system modeling and achieves trajectory tracking with relatively low computational time and small errors. We validate the control effectiveness of the algorithm in specific scenario through CarSim-Simulink simulation and perform comparative analysis with PID and vehicle kinematics MPC, confirming the feasibility and superiority of it for vehicle steering control.
△ Less
Submitted 18 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Authors:
Anbai Jiang,
Bing Han,
Zhiqiang Lv,
Yufeng Deng,
Wei-Qiang Zhang,
Xie Chen,
Yanmin Qian,
Jia Liu,
Pingyi Fan
Abstract:
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res…
▽ More
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Frequency-mix Knowledge Distillation for Fake Speech Detection
Authors:
Cunhang Fan,
Shunbo Dong,
Jun Xue,
Yujie Chen,
Jiangyan Yi,
Zhao Lv
Abstract:
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA…
▽ More
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31\% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Unpaired MRI Super Resolution with Contrastive Learning
Authors:
Hao Li,
Quanwei Liu,
Jianan Liu,
Xiling Liu,
Yanni Dong,
Tao Huang,
Zhihan Lv
Abstract:
Magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. However, the inherent long scan time of MRI restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. Due to lacking of aligned high-resolution (HR) and low-resolution (LR) MRI image pairs, uns…
▽ More
Magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. However, the inherent long scan time of MRI restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. Due to lacking of aligned high-resolution (HR) and low-resolution (LR) MRI image pairs, unsupervised approaches are widely adopted for SR reconstruction with unpaired MRI images. However, these methods still require a substantial number of HR MRI images for training, which can be difficult to acquire. To this end, we propose an unpaired MRI SR approach that employs contrastive learning to enhance SR performance with limited HR training data. Empirical results presented in this study underscore significant enhancements in the peak signal-to-noise ratio and structural similarity index, even when a paucity of HR images is available. These findings accentuate the potential of our approach in addressing the challenge of limited HR training data, thereby contributing to the advancement of MRI in clinical applications.
△ Less
Submitted 16 February, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
How do the resting EEG preprocessing states affect the outcomes of postprocessing?
Authors:
Shiang Hu,
Jie Ruan,
Juan Hou,
Pedro Antonio Valdes-Sosa,
Zhao Lv
Abstract:
Plenty of artifact removal tools and pipelines have been developed to correct the EEG recordings and discover the values below the waveforms. Without visual inspection from the experts, it is susceptible to derive improper preprocessing states, like the insufficient preprocessed EEG (IPE), and the excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the pos…
▽ More
Plenty of artifact removal tools and pipelines have been developed to correct the EEG recordings and discover the values below the waveforms. Without visual inspection from the experts, it is susceptible to derive improper preprocessing states, like the insufficient preprocessed EEG (IPE), and the excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the postprocessing in the frequency, spatial and temporal domains, particularly as to the spectra and the functional connectivity (FC) analysis. Here, the clean EEG (CE) was synthesized as the ground truth based on the New-York head model and the multivariate autoregressive model. Later, the IPE and the EPE were simulated by injecting the Gaussian noise and losing the brain activities, respectively. Then, the impacts on postprocessing were quantified by the deviation caused by the IPE or EPE from the CE as to the 4 temporal statistics, the multichannel power, the cross spectra, the dispersion of source imaging, and the properties of scalp EEG network. Lastly, the association analysis was performed between the PaLOSi metric and the varying trends of postprocessing with the evolution of preprocessing states. This study shed light on how the postprocessing outcomes are affected by the preprocessing states and PaLOSi may be a potential effective quality metric.
△ Less
Submitted 12 December, 2023; v1 submitted 22 October, 2023;
originally announced October 2023.
-
Spectral homogeneity cross frequencies can be a quality metric for the large-scale resting EEG preprocessing
Authors:
Shiang Hu,
Jie Ruan,
Nicolas Langer,
Jorge Bosch-Bayard,
Zhao Lv,
Dezhong Yao,
Pedro Antonio Valdes-Sosa
Abstract:
The brain projects require the collection of massive electrophysiological data, aiming to the longitudinal, sectional, or populational neuroscience studies. Quality metrics automatically label the data after centralized preprocessing. However, although the waveforms-based metrics are partially useful, they may be unreliable by neglecting the spectral profiles. Here, we detected the phenomenon of p…
▽ More
The brain projects require the collection of massive electrophysiological data, aiming to the longitudinal, sectional, or populational neuroscience studies. Quality metrics automatically label the data after centralized preprocessing. However, although the waveforms-based metrics are partially useful, they may be unreliable by neglecting the spectral profiles. Here, we detected the phenomenon of parallel log spectra (PaLOS) that the scalp EEG power in the log scale were parallel to each other from 10% of 2549 HBN EEG. This phenomenon was reproduced in 8% of 412 PMDT EEG from 4 databases. We designed the PaLOS index (PaLOSi) to indicate this phenomenon by decomposing the cross-spectra at different frequencies into the common principal component spaces. We found that the PaLOS biophysically implied a prominently dominant dipole in the source space which was implausible for the resting EEG. And it may be practically resulted from excessive preprocessing. Compared with the 1966 normative EEG cross-spectra, the HBN and the PMDT EEG with PaLOS presented generally much higher electrode pairwise coherences and higher similarity of coherence-based network patterns, which went against the known frequency dependent characteristic of coherence networks. We suggest the PaLOSi should lay in the range of 0.4-0.7 for large resting EEG quality assurance.
△ Less
Submitted 4 December, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection
Authors:
Cunhang Fan,
Mingming Ding,
Jianhua Tao,
Ruibo Fu,
Jiangyan Yi,
Zhengqi Wen,
Zhao Lv
Abstract:
Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel…
▽ More
Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion module and response-based teacher-student paradigms are proposed to guide the training of noisy data from both the data distribution and decision-making perspectives. In the noisy student branch, speech enhancement is introduced initially for denoising, aiming to reduce the interference of strong noise. The proposed interactive fusion combines denoised features and noisy features to mitigate the impact of speech distortion and ensure consistency with the data distribution of the clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, enabling noisy speech to behave similarly to clean speech. Additionally, a joint training method is employed to optimize both branches for achieving global optimality. Experimental results based on multiple datasets demonstrate that the proposed method performs effectively in noisy environments and maintains its performance in cross-dataset experiments. Source code is available at https://github.com/fchest/DKDSSD.
△ Less
Submitted 16 April, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection
Authors:
Cunhang Fan,
Hongyu Zhang,
Wei Huang,
Jun Xue,
Jianhua Tao,
Jiangyan Yi,
Zhao Lv,
Xiaopei Wu
Abstract:
Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean…
▽ More
Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean characteristics. In order to address this problem, this paper proposes a dynamical graph self-distillation (DGSD) approach for AAD, which does not require speech stimuli as input. Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals. In addition, to further improve AAD detection performance, self-distillation, consisting of feature distillation and hierarchical distillation strategies at each layer, is integrated. These strategies leverage features and classification results from the deepest network layers to guide the learning of shallow layers. Our experiments are conducted on two publicly available datasets, KUL and DTU. Under a 1-second time window, we achieve results of 90.0\% and 79.6\% accuracy on KUL and DTU, respectively. We compare our DGSD method with competitive baselines, and the experimental results indicate that the detection performance of our proposed DGSD method is not only superior to the best reproducible baseline but also significantly reduces the number of trainable parameters by approximately 100 times.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Implicit Neural Representation for MRI Parallel Imaging Reconstruction
Authors:
Hao Li,
Yusheng Zhou,
Jianan Liu,
Xiling Liu,
Tao Huang,
Zhihan Lv,
Weidong Cai
Abstract:
Magnetic resonance imaging (MRI) usually faces lengthy acquisition times, prompting the exploration of strategies such as parallel imaging (PI) to alleviate this problem by periodically skipping specific K-space lines and subsequently reconstructing high-quality images from the undersampled K-space. Implicit neural representation (INR) has recently emerged as a promising deep learning technique, c…
▽ More
Magnetic resonance imaging (MRI) usually faces lengthy acquisition times, prompting the exploration of strategies such as parallel imaging (PI) to alleviate this problem by periodically skipping specific K-space lines and subsequently reconstructing high-quality images from the undersampled K-space. Implicit neural representation (INR) has recently emerged as a promising deep learning technique, characterizing objects as continuous functions of spatial coordinates typically parameterized by a multilayer perceptron (MLP). In this study, we propose a novel MRI PI reconstruction method that uses INR. Our approach represents reconstructed fully-sampled images as functions of voxel coordinates and prior feature vectors from undersampled images, addressing the generalization challenges of INR. Specifically, we introduce a scale-embedded encoder to generate scale-independent, voxel-specific features from MR images across various undersampling scales. These features are then concatenated with coordinate vectors to reconstruct fully-sampled MR images, facilitating multiple-scale reconstructions. To evaluate our method's performance, we conducted experiments using publicly available MRI datasets, comparing it with alternative reconstruction techniques. Our quantitative assessment demonstrates the superiority of our proposed method.
△ Less
Submitted 10 April, 2024; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection
Authors:
Cunhang Fan,
Jun Xue,
Jianhua Tao,
Jiangyan Yi,
Chenglong Wang,
Chengshi Zheng,
Zhao Lv
Abstract:
The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 sub…
▽ More
The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
△ Less
Submitted 8 July, 2024; v1 submitted 19 August, 2023;
originally announced August 2023.
-
Graph Embedding Dynamic Feature-based Supervised Contrastive Learning of Transient Stability for Changing Power Grid Topologies
Authors:
Zijian Lv,
Xin Chen,
Zijian Feng
Abstract:
Accurate online transient stability prediction is critical for ensuring power system stability when facing disturbances. While traditional transient stablity analysis replies on the time domain simulations can not be quickly adapted to the power grid toplogy change. In order to vectorize high-dimensional power grid topological structure information into low-dimensional node-based graph embedding s…
▽ More
Accurate online transient stability prediction is critical for ensuring power system stability when facing disturbances. While traditional transient stablity analysis replies on the time domain simulations can not be quickly adapted to the power grid toplogy change. In order to vectorize high-dimensional power grid topological structure information into low-dimensional node-based graph embedding streaming data, graph embedding dynamic feature (GEDF) has been proposed. The transient stability GEDF-based supervised contrastive learning (GEDF-SCL) model uses supervised contrastive learning to predict transient stability with GEDFs, considering power grid topology information. To evaluate the performance of the proposed GEDF-SCL model, power grids of varying topologies were generated based on the IEEE 39-bus system model. Transient operational data was obtained by simulating N-1 and N-$\bm{m}$-1 contingencies on these generated power system topologies. Test result demonstrated that the GEDF-SCL model can achieve high accuracy in transient stability prediction and adapt well to changing power grid topologies.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection
Authors:
Shunbo Dong,
Jun Xue,
Cunhang Fan,
Kang Zhu,
Yujie Chen,
Zhao Lv
Abstract:
In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the mode…
▽ More
In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the model's ability to locate discriminative information. Specmix cuts and pastes the frequency dimension information of the spectrogram in the same batch of samples without introducing other data, which helps the model to locate the really useful information. At the same time, we randomly select samples for augmentation to reduce the impact of data augmentation directly changing all the data. Once the purpose of helping the model to locate information is achieved, it is also important to reduce unnecessary information. The role of MPIF-Res2Net is to reduce redundant interference information. Deceptive information from a single perspective is always similar, so the model learning this similar information will produce redundant spoofing clues and interfere with truly discriminative information. The proposed MPIF-Res2Net fuses information from different perspectives, making the information learned by the model more diverse, thereby reducing the redundancy caused by similar information and avoiding interference with the learning of discriminative information. The results on the ASVspoof 2021 LA dataset demonstrate the effectiveness of our proposed method, achieving EER and min-tDCF of 3.29% and 0.2557, respectively.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Transferable Deep Learning Power System Short-Term Voltage Stability Assessment with Physics-Informed Topological Feature Engineering
Authors:
Zijian Feng,
Xin Chen,
Zijian Lv,
Peiyuan Sun,
Kai Wu
Abstract:
Deep learning (DL) algorithms have been widely applied to short-term voltage stability (STVS) assessment in power systems. However, transferring the knowledge learned in one power grid to other power grids with topology changes is still a challenging task. This paper proposed a transferable DL-based model for STVS assessment by constructing the topology-aware voltage dynamic features from raw PMU…
▽ More
Deep learning (DL) algorithms have been widely applied to short-term voltage stability (STVS) assessment in power systems. However, transferring the knowledge learned in one power grid to other power grids with topology changes is still a challenging task. This paper proposed a transferable DL-based model for STVS assessment by constructing the topology-aware voltage dynamic features from raw PMU data. Since the reactive power flow and grid topology are essential to voltage stability, the topology-aware and physics-informed voltage dynamic features are utilized to effectively represent the topological and temporal patterns from post-disturbance system dynamic trajectories. The proposed DL-based STVS assessment model is tested under random operating conditions on the New England 39-bus system. It has 99.99\% classification accuracy of the short-term voltage stability status using the topology-aware and physics-informed voltage dynamic features. In addition to high accuracy, the experiments show good adaptability to PMU errors. Moreover, The proposed STVS assessment method has outstanding performance on new grid topologies after fine-tuning. In particular, the highest accuracy reaches 99.68\% in evaluation, which demonstrates a good knowledge transfer ability of the proposed model for power grid topology change.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Learning From Yourself: A Self-Distillation Method for Fake Speech Detection
Authors:
Jun Xue,
Cunhang Fan,
Jiangyan Yi,
Chenglong Wang,
Zhengqi Wen,
Dan Zhang,
Zhao Lv
Abstract:
In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can…
▽ More
In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can not capture this very well. To address this problem, we propose using the deepest network instruct shallow network for enhancing shallow networks. Specifically, the networks of FSD are divided into several segments, the deepest network being used as the teacher model, and all shallow networks become multiple student models by adding classifiers. Meanwhile, the distillation path between the deepest network feature and shallow network features is used to reduce the feature difference. A series of experimental results on the ASVspoof 2019 LA and PA datasets show the effectiveness of the proposed method, with significant improvements compared to the baseline.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Explicit Abnormality Extraction for Unsupervised Motion Artifact Reduction in Magnetic Resonance Imaging
Authors:
Yusheng Zhou,
Hao Li,
Jianan Liu,
Zhengmin Kong,
Tao Huang,
Euijoon Ahn,
Zhihan Lv,
Jinman Kim,
David Dagan Feng
Abstract:
Motion artifacts compromise the quality of magnetic resonance imaging (MRI) and pose challenges to achieving diagnostic outcomes and image-guided therapies. In recent years, supervised deep learning approaches have emerged as successful solutions for motion artifact reduction (MAR). One disadvantage of these methods is their dependency on acquiring paired sets of motion artifact-corrupted (MA-corr…
▽ More
Motion artifacts compromise the quality of magnetic resonance imaging (MRI) and pose challenges to achieving diagnostic outcomes and image-guided therapies. In recent years, supervised deep learning approaches have emerged as successful solutions for motion artifact reduction (MAR). One disadvantage of these methods is their dependency on acquiring paired sets of motion artifact-corrupted (MA-corrupted) and motion artifact-free (MA-free) MR images for training purposes. Obtaining such image pairs is difficult and therefore limits the application of supervised training. In this paper, we propose a novel UNsupervised Abnormality Extraction Network (UNAEN) to alleviate this problem. Our network is capable of working with unpaired MA-corrupted and MA-free images. It converts the MA-corrupted images to MA-reduced images by extracting abnormalities from the MA-corrupted images using a proposed artifact extractor, which intercepts the residual artifact maps from the MA-corrupted MR images explicitly, and a reconstructor to restore the original input from the MA-reduced images. The performance of UNAEN was assessed by experimenting with various publicly available MRI datasets and comparing them with state-of-the-art methods. The quantitative evaluation demonstrates the superiority of UNAEN over alternative MAR methods and visually exhibits fewer residual artifacts. Our results substantiate the potential of UNAEN as a promising solution applicable in real-world clinical environments, with the capability to enhance diagnostic accuracy and facilitate image-guided therapies. Our codes are publicly available at https://github.com/YuSheng-Zhou/UNAEN.
△ Less
Submitted 14 August, 2024; v1 submitted 4 January, 2023;
originally announced January 2023.
-
Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features
Authors:
Jun Xue,
Cunhang Fan,
Zhao Lv,
Jianhua Tao,
Jiangyan Yi,
Chengshi Zheng,
Zhengqi Wen,
Minmin Yuan,
Shegang Shao
Abstract:
Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific informatio…
▽ More
Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately. Finally, the results of F0, real and imaginary spectrogram features are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Pervasive wireless channel modeling theory and applications to 6G GBSMs for all frequency bands and all scenarios
Authors:
Cheng-Xiang Wang,
Zhen Lv,
Xiqi Gao,
Xiaohu You,
Yang Hao,
Harald Haas
Abstract:
In this paper, a pervasive wireless channel modeling theory is first proposed, which uses a unified channel modeling method and a unified equation of channel impulse response (CIR), and can integrate important channel characteristics at different frequency bands and scenarios. Then, we apply the proposed theory to a three dimensional (3D) space-time-frequency (STF) non-stationary geometry-based st…
▽ More
In this paper, a pervasive wireless channel modeling theory is first proposed, which uses a unified channel modeling method and a unified equation of channel impulse response (CIR), and can integrate important channel characteristics at different frequency bands and scenarios. Then, we apply the proposed theory to a three dimensional (3D) space-time-frequency (STF) non-stationary geometry-based stochastic model (GBSM) for the sixth generation (6G) wireless communication systems. The proposed 6G pervasive channel model (6GPCM) can characterize statistical properties of channels at all frequency bands from sub-6 GHz to visible light communication (VLC) bands and all scenarios such as unmanned aerial vehicle (UAV), maritime, (ultra-)massive multiple-input multiple-output (MIMO), reconfigurable intelligent surface (RIS), and industry Internet of things (IIoT) scenarios. By adjusting channel model parameters, the 6GPCM can be reduced to various simplified channel models for specific frequency bands and scenarios. Also, it includes standard fifth generation (5G) channel models as special cases. In addition, key statistical properties of the proposed 6GPCM are derived, simulated, and verified by various channel measurement results, which clearly demonstrates its accuracy, pervasiveness, and applicability.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
Authors:
Yang Zhang,
Zhiqiang Lv,
Haibin Wu,
Shanshan Zhang,
Pengfei Hu,
Zhiyong Wu,
Hung-yi Lee,
Helen Meng
Abstract:
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution s…
▽ More
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution subsampling layer to decrease the computational cost of the model. Secondly, we adopt Conformer blocks which combine Transformers and convolution neural networks (CNNs) to capture global and local features effectively. Finally, the output feature maps from all Conformer blocks are concatenated to aggregate multi-scale representations before final pooling. We evaluate the MFA-Conformer on the widely used benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O, SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly outperforms the popular ECAPA-TDNN systems in both recognition performance and inference speed. Last but not the least, the ablation studies clearly demonstrate that the combination of global and local feature learning can lead to robust and accurate speaker embedding extraction. We have also released the code for future comparison.
△ Less
Submitted 10 November, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
VRM-Phase I VKW system description of long-short video customizable keyword wakeup challenge
Authors:
Yougen Yuan,
Zhiqiang Lv,
Shen Huang,
Pengfei Hu
Abstract:
Keyword wakeup technology has always been a research hotspot in speech processing, but many related works were done on different datasets. We organized a Chinese long-short video keyword wakeup challenge (Video Keyword Wakeup Challenge, VKW) for testing the ability of each participating team to build a keyword wakeup system under the public dataset. All submitted systems not only need to support t…
▽ More
Keyword wakeup technology has always been a research hotspot in speech processing, but many related works were done on different datasets. We organized a Chinese long-short video keyword wakeup challenge (Video Keyword Wakeup Challenge, VKW) for testing the ability of each participating team to build a keyword wakeup system under the public dataset. All submitted systems not only need to support the setting of multiple different keywords, but also need to support the wakeup of any costumed keyword.This paper mainly describes the basic situation of the VKW challenge and the experimental results of some participating teams.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Stochastic Dispatch of Energy Storage in Microgrids: An Augmented Reinforcement Learning Approach
Authors:
Yuwei Shang,
Wenchuan Wu,
Jianbo Guo,
Zhe Lv,
Zhao Ma,
Wanxing Sheng,
Ran Chen
Abstract:
The dynamic dispatch (DD) of battery energy storage systems (BESSs) in microgrids integrated with volatile energy resources is essentially a multiperiod stochastic optimization problem (MSOP). Because the life span of a BESS is significantly affected by its charging and discharging behaviors, its lifecycle degradation costs should be incorporated into the DD model of BESSs, which makes it non-conv…
▽ More
The dynamic dispatch (DD) of battery energy storage systems (BESSs) in microgrids integrated with volatile energy resources is essentially a multiperiod stochastic optimization problem (MSOP). Because the life span of a BESS is significantly affected by its charging and discharging behaviors, its lifecycle degradation costs should be incorporated into the DD model of BESSs, which makes it non-convex. In general, this MSOP is intractable. To solve this problem, we propose a reinforcement learning (RL) solution augmented with Monte-Carlo tree search (MCTS) and domain knowledge expressed as dispatching rules. In this solution, the Q-learning with function approximation is employed as the basic learning architecture that allows multistep bootstrapping and continuous policy learning. To improve the computation efficiency of randomized multistep simulations, we employed the MCTS to estimate the expected maximum action values. Moreover, we embedded a few dispatching rules in RL as probabilistic logics to reduce infeasible action explorations, which can improve the quality of the data-driven solution. Numerical test results show the proposed algorithm outperforms other baseline RL algorithms in all cases tested.
△ Less
Submitted 4 July, 2020; v1 submitted 10 October, 2019;
originally announced October 2019.