-
SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge
Authors:
You Zhang,
Yongyi Zang,
Jiatong Shi,
Ryuichi Yamamoto,
Tomoki Toda,
Zhiyao Duan
Abstract:
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac…
▽ More
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Authors:
Samuele Cornell,
Jordan Darefsky,
Zhiyao Duan,
Shinji Watanabe
Abstract:
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generat…
▽ More
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Designing Consensus-Based Distributed Filtering over Directed Graphs
Authors:
Xiaoxu Lyu,
Guanghui Wen,
Yuezu Lv,
Zhisheng Duan,
Ling Shi
Abstract:
This paper proposes a novel consensus-on-only-measurement distributed filter over directed graphs under the collectively observability condition. First, the distributed filter structure is designed with an augmented leader-following measurement fusion strategy. Subsequently, two parameter design methods are presented, and the consensus gain parameter is devised utilizing local information exclusiv…
▽ More
This paper proposes a novel consensus-on-only-measurement distributed filter over directed graphs under the collectively observability condition. First, the distributed filter structure is designed with an augmented leader-following measurement fusion strategy. Subsequently, two parameter design methods are presented, and the consensus gain parameter is devised utilizing local information exclusively rather than global information. Additionally, the lower bound of the fusion step is derived to guarantee a uniformly upper bound of the estimation error covariance. Moreover, the lower bounds of the convergence rates of the steady-state performance gap between the proposed algorithm and the centralized filter are provided with the fusion step approaching infinity. The analysis demonstrates that the convergence rate is, at a minimum, as rapid as exponential convergence under the spectral norm condition of the communication graph. The transient performance is also analyzed with the fusion step tending to infinity. The inherent trade-off between the communication cost and the filtering performance is revealed from the analysis of the steady-state performance and the transient performance. Finally, the theoretical results are substantiated through the validation of two simulation examples.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Performance Analysis of Distributed Filtering under Mismatched Noise Covariances
Authors:
Xiaoxu Lyu,
Guanghui Wen,
Ling Shi,
Peihu Duan,
Zhisheng Duan
Abstract:
This paper systematically investigates the performance of consensus-based distributed filtering under mismatched noise covariances. First, we introduce three performance evaluation indices for such filtering problems,namely the standard performance evaluation index, the nominal performance evaluation index, and the estimation error covariance. We derive difference expressions among these indices a…
▽ More
This paper systematically investigates the performance of consensus-based distributed filtering under mismatched noise covariances. First, we introduce three performance evaluation indices for such filtering problems,namely the standard performance evaluation index, the nominal performance evaluation index, and the estimation error covariance. We derive difference expressions among these indices and establish one-step relations among them under various mismatched noise covariance scenarios. We particularly reveal the effect of the consensus fusion on these relations. Furthermore, the recursive relations are introduced by extending the results of the one-step relations. Subsequently, we demonstrate the convergence of these indices under the collective observability condition, and show this convergence condition of the nominal performance evaluation index can guarantee the convergence of the estimation error covariance. Additionally, we prove that the estimation error covariance of the consensus-based distributed filter under mismatched noise covariances can be bounded by the Frobenius norms of the noise covariance deviations and the trace of the nominal performance evaluation index. Finally, the effectiveness of the theoretical results is verified by numerical simulations.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
Authors:
Kyungbok Lee,
You Zhang,
Zhiyao Duan
Abstract:
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for t…
▽ More
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake videos (Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and Unsynchronized videos). The experimental results demonstrate that our approach surpasses the previous models by a large margin. Furthermore, our proposed framework offers interpretability, indicating which modality the model identifies as more likely to be fake. The source code is released at https://github.com/bok-bok/MSOC.
△ Less
Submitted 19 August, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Authors:
Zehua Kcriss Li,
Meiying Melissa Chen,
Yi Zhong,
Pinxin Liu,
Zhiyao Duan
Abstract:
Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically,…
▽ More
Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
On Efficient Neural Network Architectures for Image Compression
Authors:
Yichi Zhang,
Zhihao Duan,
Fengqing Zhu
Abstract:
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio…
▽ More
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Authors:
Yongyi Zang,
Jiatong Shi,
You Zhang,
Ryuichi Yamamoto,
Jionghao Han,
Yuxun Tang,
Shengyuan Xu,
Wenxiao Zhao,
Jing Guo,
Tomoki Toda,
Zhiyao Duan
Abstract:
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi…
▽ More
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.
△ Less
Submitted 18 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan
Authors:
You Zhang,
Yongyi Zang,
Jiatong Shi,
Ryuichi Yamamoto,
Jionghao Han,
Yuxun Tang,
Tomoki Toda,
Zhiyao Duan
Abstract:
The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ…
▽ More
The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Scoring Time Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription
Authors:
Yujia Yan,
Zhiyao Duan
Abstract:
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed time intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient an…
▽ More
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed time intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. This paper introduces a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only structured non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
△ Less
Submitted 11 August, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Learning to Classify New Foods Incrementally Via Compressed Exemplars
Authors:
Justin Yang,
Zhihao Duan,
Jiangpeng He,
Fengqing Zhu
Abstract:
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image…
▽ More
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems
Authors:
Md Adnan Faisal Hossain,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a ran…
▽ More
Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a range of rates by introducing a rate control parameter as an input to the neural network model. By compressing different intermediate features of a pre-trained vision task model, the proposed method can scale the encoding complexity without changing the overall size of the model. The proposed method is more flexible than existing baselines, at the same time outperforming them in terms of the three-way trade-off between feature compression rate, vision task accuracy, and encoding complexity. We have made the source code available at https://github.com/adnan-hossain/var_feat_comp.git.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs
Authors:
Yichi Zhang,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bo…
▽ More
Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
MusicHiFi: Fast High-Fidelity Stereo Vocoding
Authors:
Ge Zhu,
Juan-Pablo Caceres,
Zhiyao Duan,
Nicholas J. Bryan
Abstract:
Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fideli…
▽ More
Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at \url{https://MusicHiFi.github.io/web/}.
△ Less
Submitted 8 July, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Towards Backward-Compatible Continual Learning of Image Compression
Authors:
Zhihao Duan,
Ming Lu,
Justin Yang,
Jiangpeng He,
Zhan Ma,
Fengqing Zhu
Abstract:
This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine…
▽ More
This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Toward Fully Self-Supervised Multi-Pitch Estimation
Authors:
Frank Cwitkowitz,
Zhiyao Duan
Abstract:
Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with m…
▽ More
Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Cacophony: An Improved Contrastive Audio-Text Model
Authors:
Ge Zhu,
Jordan Darefsky,
Zhiyao Duan
Abstract:
Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process n…
▽ More
Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then, initializing our audio encoder from the MAE model, train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.
△ Less
Submitted 29 April, 2024; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding
Authors:
Yichi Zhang,
Zhihao Duan,
Ming Lu,
Dandan Ding,
Fengqing Zhu,
Zhan Ma
Abstract:
While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image…
▽ More
While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Data-driven Dynamic Event-triggered Control
Authors:
Tao Xu,
Zhiyong Sun,
Guanghui Wen,
Zhisheng Duan
Abstract:
This paper revisits the event-triggered control problem from a data-driven perspective, where unknown continuous-time linear systems subject to disturbances are taken into account. Using data information collected off-line instead of accurate system model information, a data-driven dynamic event-triggered control scheme is developed in this paper. The dynamic property is reflected by that the desi…
▽ More
This paper revisits the event-triggered control problem from a data-driven perspective, where unknown continuous-time linear systems subject to disturbances are taken into account. Using data information collected off-line instead of accurate system model information, a data-driven dynamic event-triggered control scheme is developed in this paper. The dynamic property is reflected by that the designed event-triggering function embedded in the event-triggering mechanism (ETM) is dynamically updated as a whole. Thanks to this dynamic design, a strictly positive minimum inter-event time (MIET) is guaranteed without sacrificing control performance. Specifically, exponential input-to-state stability (ISS) of the closed-loop system with respect to disturbances is achieved in this paper, which is superior to some existing results that only guarantee a practical exponential ISS property. The dynamic ETM is easy-to-implement in practical operation since all designed parameters are determined only by a simple data-driven linear matrix inequality (LMI), without additional complicated conditions as required in relevant literature. As quantization is the most common signal constraint in practice, the developed control scheme is further extended to the case where state transmission is affected by a uniform or logarithmic quantization effect. Finally, adequate simulations are performed to show the validity and superiority of the proposed control schemes.
△ Less
Submitted 6 January, 2024;
originally announced January 2024.
-
Battery-Care Resource Allocation and Task Offloading in Multi-Agent Post-Disaster MEC Environment
Authors:
Yiwei Tang,
Hualong Huang,
Wenhan Zhan,
Geyong Min,
Zhekai Duan,
Yuchuan Lei
Abstract:
Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaste…
▽ More
Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaster MEC environment with unstable 5G communication, where device-to-device (D2D) link communication and dynamic voltage and frequency scaling (DVFS) are adopted to balance each user's requirement for task delay and energy consumption. A battery degradation evaluation approach to prolong battery lifetime is also presented. The distributed optimization problem is formulated into a mixed cooperative-competitive (MCC) multi-agent Markov decision process (MAMDP) and is tackled with recurrent multi-agent Proximal Policy Optimization (rMAPPO). Extensive simulations and comprehensive comparisons with other representative algorithms clearly demonstrate the effectiveness of the proposed rMAPPO-based offloading scheme.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Deep Hierarchical Video Compression
Authors:
Ming Lu,
Zhihao Duan,
Fengqing Zhu,
Zhan Ma
Abstract:
Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f…
▽ More
Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech
Authors:
Enting Zhou,
You Zhang,
Zhiyao Duan
Abstract:
Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it a…
▽ More
Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it as a supervised regression task. These methods face data scarcity issues, as dimensional annotations are much harder to acquire than categorical labels. In this work, we propose to learn the AV representation from categorical emotion labels of speech. We start by learning a rich and emotion-relevant high-dimensional speech feature representation using self-supervised pre-training and emotion classification fine-tuning. This representation is then mapped to the 2D AV space according to psychological findings through anchored dimensionality reduction. Experiments show that our method achieves a Concordance Correlation Coefficient (CCC) performance comparable to state-of-the-art supervised regression methods on IEMOCAP without leveraging ground-truth AV annotations during training. This validates our proposed approach on AV prediction. Furthermore, visualization of AV predictions on MEAD and EmoDB datasets shows the interpretability of the learned AV representations.
△ Less
Submitted 6 February, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
A Novel Dynamic Event-triggered Mechanism for Dynamic Average Consensus
Authors:
Tao Xu,
Zhisheng Duan,
Guanghui Wen,
Zhiyong Sun
Abstract:
This paper studies a challenging issue introduced in a recent survey, namely designing a distributed event-based scheme to solve the dynamic average consensus (DAC) problem. First, a robust adaptive distributed event-based DAC algorithm is designed without imposing specific initialization criteria to perform estimation task under intermittent communication. Second, a novel adaptive distributed dyn…
▽ More
This paper studies a challenging issue introduced in a recent survey, namely designing a distributed event-based scheme to solve the dynamic average consensus (DAC) problem. First, a robust adaptive distributed event-based DAC algorithm is designed without imposing specific initialization criteria to perform estimation task under intermittent communication. Second, a novel adaptive distributed dynamic event-triggered mechanism is proposed to determine the triggering time when neighboring agents broadcast information to each other. Compared to the existing event-triggered mechanisms, the novelty of the proposed dynamic event-triggered mechanism lies in that it guarantees the existence of a positive and uniform minimum inter-event interval without sacrificing any accuracy of the estimation, which is much more practical than only ensuring the exclusion of the Zeno behavior or the boundedness of the estimation error. Third, a composite adaptive law is developed to update the adaptive gain employed in the distributed event-based DAC algorithm and dynamic event-triggered mechanism. Using the composite adaptive update law, the distributed event-based solution proposed in our work is implemented without requiring any global information. Finally, numerical simulations are provided to illustrate the effectiveness of the theoretical results.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Authors:
Ge Zhu,
Yutong Wen,
Marc-André Carbonneau,
Zhiyao Duan
Abstract:
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining wit…
▽ More
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
△ Less
Submitted 18 November, 2023; v1 submitted 14 November, 2023;
originally announced November 2023.
-
SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription
Authors:
Yongyi Zang,
Yi Zhong,
Frank Cwitkowitz,
Zhiyao Duan
Abstract:
Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering mode…
▽ More
Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering models trained on them prone to overfitting and incapable of generalizing to out-of-domain data. In order to address this issue, we present a methodology for synthesizing large-scale GTT audio using commercial acoustic and electric guitar plugins. We procure SynthTab, a dataset derived from DadaGP, which is a vast and diverse collection of richly annotated symbolic tablature. The proposed synthesis pipeline produces audio which faithfully adheres to the original fingerings and a subset of techniques specified in the tablature, and covers multiple guitars and styles for each track. Experiments show that pre-training a baseline GTT model on SynthTab can improve transcription performance when fine-tuning and testing on an individual dataset. More importantly, cross-dataset experiments show that pre-training significantly mitigates issues with overfitting.
△ Less
Submitted 24 January, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
SingFake: Singing Voice Deepfake Detection
Authors:
Yongyi Zang,
You Zhang,
Mojtaba Heydari,
Zhiyao Duan
Abstract:
The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances.…
▽ More
The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/validation/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available at https://www.singfake.org/.
△ Less
Submitted 21 January, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
An Improved Upper Bound on the Rate-Distortion Function of Images
Authors:
Zhihao Duan,
Jack Ma,
Jiangpeng He,
Fengqing Zhu
Abstract:
Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) p…
▽ More
Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) proposing a novel \ourfunction{} to stabilize training. We demonstrate that at least 30\% BD-rate reduction w.r.t. the intra prediction mode in VVC codec is achievable, suggesting that there is still great potential for improving lossy image compression. Code is made publicly available at https://github.com/duanzhiihao/lossy-vae.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Mitigating Cross-Database Differences for Learning Unified HRTF Representation
Authors:
Yutong Wen,
You Zhang,
Zhiyao Duan
Abstract:
Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively l…
▽ More
Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively limited samples. However, in addition to differences on the spatial sampling locations, recent studies have shown that, even for the common location, HRTFs across databases manifest consistent differences that make it trivial to tell which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases. In this work, we first identify the possible causes of these cross-database differences, attributing them to variations in the measurement setup. Then, we propose a novel approach to normalize the frequency responses of HRTFs across databases. We show that HRTFs from different databases cannot be classified by their database after normalization. We further show that these normalized HRTFs can be used to learn a more unified HRTF representation across databases than the prior art. We believe that this normalization approach paves the road to many data-intensive tasks on HRTF modeling.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
On the Effects and Optimal Design of Redundant Sensors in Collaborative State Estimation
Authors:
Yunxiao Ren,
Zhisheng Duan,
Peihu Duan,
Ling Shi
Abstract:
The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the…
▽ More
The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the effects of redundant sensors and an engineering-oriented optimal design of redundant sensors. In the theoretical analysis, the paper leverages Riccati equations and Symplectic matrix theory to unveil the explicit role of redundant sensors in cooperative state estimation. The results unequivocally demonstrate that the addition of redundant sensors enhances the estimation performance of the sensor network, aligning with the principle of ``more is better". Moreover, the paper establishes a precise sufficient and necessary condition to assess whether the inclusion of redundant sensors improves the overall estimation performance. Moving towards engineering-oriented design optimization, the paper proposes a novel algorithm to tackle the optimal design problem of redundant sensors, and the convergence of the proposed algorithm is guaranteed. Numerical simulations are provided to demonstrate the results.
△ Less
Submitted 4 February, 2024; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Phase perturbation improves channel robustness for speech spoofing countermeasures
Authors:
Yongyi Zang,
You Zhang,
Zhiyao Duan
Abstract:
In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitu…
▽ More
In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitude information, therefore heavily altering phase information. Also, state-of-the-art CM systems rely on phase information to identify spoofed speech. Thus, we believe the information loss in the phase domain induced by lossy compression codec degrades the performance of the unseen channel. We first establish the dependence of time-domain CM systems on phase information by perturbing phase in evaluation, showing strong degradation. Then, we demonstrated that perturbing phase during training leads to a significant performance improvement, whereas perturbing magnitude leads to further degradation.
△ Less
Submitted 6 October, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System
Authors:
Mojtaba Heydari,
Ju-Chiang Wang,
Zhiyao Duan
Abstract:
Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces furthe…
▽ More
Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces further constraints such as inaccessibility to future data and the impossibility to correct the previous results that are inconsistent with the latter ones. In this paper, we introduce the first system that tracks the beats and downbeats of singing voices in real-time. Specifically, we propose a novel dynamic particle filtering approach that incorporates offline historical data to correct the online inference by using a variable number of particles. We evaluate the performance on two datasets: GTZAN with the separated vocal tracks, and an in-house dataset with the original vocal stems. Experimental result demonstrates that our proposed approach outperforms the baseline by 3-5%.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
GNCformer Enhanced Self-attention for Automatic Speech Recognition
Authors:
J. Li,
Z. Duan,
S. Li,
X. Yu,
G. Yang
Abstract:
In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting t…
▽ More
In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting the ESA is also worth being explored.In this paper, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named GNCformer. The effectiveness of the GNCformer has been validated using two datasets, that are Aishell-1 and HKUST.Experimental results show that, compared with the Transformer network,0.8%CER,and 1.2%CER improvement for these two mentioned datasets, respectively, can be achieved.It is worth mentioning that only 1.4M additional parameters have been involved in our proposed GNCformer.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Sim-T: Simplify the Transformer Network by Multiplexing Technique for Speech Recognition
Authors:
Guangyong Wei,
Zhikui Duan,
Shiren Li,
Guangguang Yang,
Xinmei Yu,
Junhua Li
Abstract:
In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim…
▽ More
In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim-T has been proposed to expand the generality of the Transformer model. Under the help of the newly developed multiplexing technique, the Sim-T can efficiently compress the model with negligible sacrifice on its performance. To be more precise, the proposed technique includes two parts, that are, module weight multiplexing and attention score multiplexing. Moreover, a novel decoder structure has been proposed to facilitate the attention score multiplexing. Extensive experiments have been conducted to validate the effectiveness of Sim-T. In Aishell-1 dataset, when the proposed Sim-T is 48% parameter less than the baseline Transformer, 0.4% CER improvement can be obtained. Alternatively, 69% parameter reduction can be achieved if the Sim-T gives the same performance as the baseline Transformer. With regard to the HKUST and WSJ eval92 datasets, CER and WER will be improved by 0.3% and 0.2%, respectively, when parameters in Sim-T are 40% less than the baseline Transformer.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Observation of Periodic Systems: Bridge Centralized Kalman Filtering and Consensus-Based Distributed Filtering
Authors:
Jiachen Qian,
Zhisheng Duan,
Peihu Duan,
Zhongkui Li
Abstract:
Compared with linear time invariant systems, linear periodic system can describe the periodic processes arising from nature and engineering more precisely. However, the time-varying system parameters increase the difficulty of the research on periodic system, such as stabilization and observation. This paper aims to consider the observation problem of periodic systems by bridging two fundamental f…
▽ More
Compared with linear time invariant systems, linear periodic system can describe the periodic processes arising from nature and engineering more precisely. However, the time-varying system parameters increase the difficulty of the research on periodic system, such as stabilization and observation. This paper aims to consider the observation problem of periodic systems by bridging two fundamental filtering algorithms for periodic systems with a sensor network: consensus-on-measurement-based distributed filtering (CMDF) and centralized Kalman filtering (CKF). Firstly, one mild convergence condition based on uniformly collective observability is established for CMDF, under which the filtering performance of CMDF can be formulated as a symmetric periodic positive semidefinite (SPPS) solution to a discrete-time periodic Lyapunov equation. Then, the closed form of the performance gap between CMDF and CKF is presented in terms of the information fusion steps and the consensus weights of the network. Moreover, it is pointed out that the estimation error covariance of CMDF exponentially converges to the centralized one with the fusion steps tending to infinity. Altogether, these new results establish a concise and specific relationship between distributed and centralized filterings, and formulate the trade-off between the communication cost and distributed filtering performance on periodic systems. Finally, the theoretical results are verified with numerical experiments.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Transcription free filler word detection with Neural semi-CRFs
Authors:
Ge Zhu,
Yujia Yan,
Juan-Pablo Caceres,
Zhiyao Duan
Abstract:
Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from…
▽ More
Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from many aspects, e.g., budget, target languages, and computational power. In this work, we investigate filler word detection system that does not depend on ASR systems. We show that, by using the structured state space sequence model (S4) and neural semi-Markov conditional random fields (semi-CRFs), we achieve an absolute F1 improvement of 6.4% (segment level) and 3.1% (event level) on the PodcastFillers dataset. We also conduct a qualitative analysis on the detected results to analyze the limitations of our proposed system.
△ Less
Submitted 11 March, 2023;
originally announced March 2023.
-
QARV: Quantization-Aware ResNet VAE for Lossy Image Compression
Authors:
Zhihao Duan,
Ming Lu,
Jack Ma,
Yuning Huang,
Zhan Ma,
Fengqing Zhu
Abstract:
This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy…
▽ More
This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and a better rate-distortion performance than existing baseline methods. The code of our method is publicly accessible at https://github.com/duanzhiihao/lossy-vae
△ Less
Submitted 1 December, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Harmonic-Copuled Riccati Equations and its Applications in Distributed Filtering
Authors:
Jiachen Qian,
Peihu Duan,
Zhisheng Duan,
Ling shi
Abstract:
The coupled Riccati equations are cosisted of multiple Riccati-like equations with solutions coupled with each other, which can be applied to depict the properties of more complex systems such as markovian systems or multi-agent systems. This paper manages to formulate and investigate a new kind of coupled Riccati equations, called harmonic-coupled Riccati equations (HCRE), from the matrix iterati…
▽ More
The coupled Riccati equations are cosisted of multiple Riccati-like equations with solutions coupled with each other, which can be applied to depict the properties of more complex systems such as markovian systems or multi-agent systems. This paper manages to formulate and investigate a new kind of coupled Riccati equations, called harmonic-coupled Riccati equations (HCRE), from the matrix iterative law of the consensus on information-based distributed filtering (CIDF) algortihm proposed in [1], where the solutions of the equations are coupled with harmonic means. Firstly, mild conditions of the existence and uniqueness of the solution to HCRE are induced with collective observability and primitiviness of weighting matrix. Then, it is proved that the matrix iterative law of CIDF will converge to the unique solution of the corresponding HCRE, hence can be used to obtain the solution to HCRE. Moreover, through applying the novel theory of HCRE, it is pointed out that the real estimation error covariance of CIDF will also become steady-state and the convergent value is simplified as the solution to a discrete time Lyapunov equation (DLE). Altogether, these new results develop the theory of the coupled Riccati equations, and provide a novel perspective on the performance analysis of CIDF algorithm, which sufficiently reduces the conservativeness of the evaluation techniques in the literature. Finally, the theoretical results are verified with numerical experiments.
△ Less
Submitted 12 July, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Efficient Feature Compression for Edge-Cloud Systems
Authors:
Zhihao Duan,
Fengqing Zhu
Abstract:
Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be eas…
▽ More
Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be easily scaled according to different computation resources on the edge device, taking a step towards achieving a rate-accuracy-complexity (RAC) trade-off. Under various settings, our feature coding system consistently outperforms previous methods in terms of the RAC performance.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing
Authors:
Siwen Ding,
You Zhang,
Zhiyao Duan
Abstract:
Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consid…
▽ More
Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
DiscreteCommunication and ControlUpdating in Event-Triggered Consensus
Authors:
Bin Cheng,
Yuezu Lv,
Zhongkui Li,
Zhisheng Duan
Abstract:
This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demand…
▽ More
This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demands at most are essentially not applicable, in this paper we establish a novel framework to solve the problem of fully distributed consensus with discrete communication and control. The first key point in this framework is the design of controllers that are only updated at discrete event instants and do not depend on global information by introducing time-varying gains inspired by the adaptive control technique. Another key point is the invention of novel dynamic triggering functions that are independent of relative information among neighboring agents. Under the established framework, we propose fully distributed state-feedback event-triggered protocols for undirected graphs and also further study the more complexed cases of output-feedback control and directed graphs. Finally, numerical examples are provided to verify the effectiveness of the proposed event-triggered protocols.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
HRTF Field: Unifying Measured HRTF Magnitude Representation with Neural Fields
Authors:
You Zhang,
Yuxiang Wang,
Zhiyao Duan
Abstract:
Head-related transfer functions (HRTFs) are a set of functions describing the spatial filtering effect of the outer ear (i.e., torso, head, and pinnae) onto sound sources at different azimuth and elevation angles. They are widely used in spatial audio rendering. While the azimuth and elevation angles are intrinsically continuous, measured HRTFs in existing datasets employ different spatial samplin…
▽ More
Head-related transfer functions (HRTFs) are a set of functions describing the spatial filtering effect of the outer ear (i.e., torso, head, and pinnae) onto sound sources at different azimuth and elevation angles. They are widely used in spatial audio rendering. While the azimuth and elevation angles are intrinsically continuous, measured HRTFs in existing datasets employ different spatial sampling schemes, making it difficult to model HRTFs across datasets. In this work, we propose to use neural fields, a differentiable representation of functions through neural networks, to model HRTFs with arbitrary spatial sampling schemes. Such representation is unified across datasets with different spatial sampling schemes. HRTFs for arbitrary azimuth and elevation angles can be derived from this representation. We further introduce a generative model named HRTF field to learn the latent space of the HRTF neural fields across subjects. We demonstrate promising performance on HRTF interpolation and generation tasks and point out potential future work.
△ Less
Submitted 23 February, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture
Authors:
Huize Li,
Hai Jin,
Long Zheng,
Yu Huang,
Xiaofei Liao,
Dan Chen,
Zhuohui Duan,
Cong Liu,
Jiahong Xu,
Chuanyi Gui
Abstract:
The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM operations. However, current sparse attention solutions introduce massive off-chip random memory access. We propose CPSAA, a novel crossbar-based PIM-featured sparse a…
▽ More
The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM operations. However, current sparse attention solutions introduce massive off-chip random memory access. We propose CPSAA, a novel crossbar-based PIM-featured sparse attention accelerator. First, we present a novel attention calculation mode. Second, we design a novel PIM-based sparsity pruning architecture. Finally, we present novel crossbar-based methods. Experimental results show that CPSAA has an average of 89.6X, 32.2X, 17.8X, 3.39X, and 3.84X performance improvement and 755.6X, 55.3X, 21.3X, 5.7X, and 4.9X energy-saving when compare with GPU, FPGA, SANGER, ReBERT, and ReTransformer.
△ Less
Submitted 7 October, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Minimal-order Appointed-time Unknown Input Observers: Design and Applications
Authors:
Yuezu Lv,
Zhongkui Li,
Zhisheng Duan
Abstract:
This paper presents a framework on minimal-order appointed-time unknown input observers for linear systems based on the pairwise observer structure. A minimal-order appointed-time observer is first proposed for the linear system without the unknown input, which can estimate the state exactly at the preset time by seeking for the unique solution of a system of linear equations. To further release t…
▽ More
This paper presents a framework on minimal-order appointed-time unknown input observers for linear systems based on the pairwise observer structure. A minimal-order appointed-time observer is first proposed for the linear system without the unknown input, which can estimate the state exactly at the preset time by seeking for the unique solution of a system of linear equations. To further release the computational burden, another form of the appointed-time observer is designed. For the general linear system with the unknown input acting on both the system dynamics and the measured output, the model reconfiguration is made to decouple the effect of the unknown input, and the gap between the existing reduced-order appointed-time unknown input observer and the possible minimal-order appointed-time observer is revealed. Based on the reconstructed model, the minimal-order appointed-time unknown input observer is presented to realize state estimation of linear system with the unknown input at the arbitrarily small preset time. The minimal-order appointed-time unknown input observer is then applied to the design of fully distributed adaptive output-feedback attack-free consensus protocols for linear multi-agent systems.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
Authors:
Meiying Chen,
Zhiyao Duan
Abstract:
Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the c…
▽ More
Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control.
△ Less
Submitted 11 January, 2024; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Singing Beat Tracking With Self-supervised Front-end and Linear Transformers
Authors:
Mojtaba Heydari,
Zhiyao Duan
Abstract:
Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing musi…
▽ More
Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing music beat tracking systems fail to deliver satisfactory performance on singing voices. In this paper, we propose singing beat tracking as a novel task, and propose the first approach to solving this task. Our approach leverages semantic information of singing voices by employing pre-trained self-supervised WavLM and DistilHuBERT speech representations as the front-end and uses a self-attention encoder layer to predict beats. To train and test the system, we obtain separated singing voices and their beat annotations using source separation and beat tracking on complete songs, followed by manual corrections. Experiments on the 741 separated vocal tracks of the GTZAN dataset show that the proposed system outperforms several state-of-the-art music beat tracking methods by a large margin in terms of beat tracking accuracy. Ablation studies also confirm the advantages of pre-trained self-supervised speech representations over generic spectral features.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
Lossy Image Compression with Quantized Hierarchical VAEs
Authors:
Zhihao Duan,
Ming Lu,
Zhan Ma,
Fengqing Zhu
Abstract:
Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantizati…
▽ More
Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantization-aware posterior and prior, enabling easy quantization and entropy coding at test time. Along with improved neural network architecture, we present a powerful and efficient model that outperforms previous methods on natural image lossy compression. Our model compresses images in a coarse-to-fine fashion and supports parallel encoding and decoding, leading to fast execution on GPUs. Code is available at https://github.com/duanzhiihao/lossy-vae.
△ Less
Submitted 25 March, 2023; v1 submitted 27 August, 2022;
originally announced August 2022.
-
Predicting Global Head-Related Transfer Functions From Scanned Head Geometry Using Deep Learning and Compact Representations
Authors:
Yuxiang Wang,
You Zhang,
Zhiyao Duan,
Mark Bocko
Abstract:
In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject's HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose nov…
▽ More
In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject's HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose novel pre-processing methods for both the head scans and HRTF data to achieve compact representations. For the head scan, we use truncated spherical cap harmonic (SCH) coefficients to represent the pinna area, which is important in the acoustic scattering process. For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets. One CNN model is trained to predict the SH coefficients of the HRTF magnitudes from the SCH coefficients of the scanned ear geometry and other anthropometric measurements of the head. The other CNN model is trained to predict SH coefficients of the HRTF onsets from only the anthropometric measurements of the ear, head, and torso. Combining the magnitude and onset predictions, our method is able to predict the complete and global HRTF data. A leave-one-out validation with the log-spectral distortion (LSD) metric is used for objective evaluation. The results show a decent LSD level at both spatial \& temporal dimensions compared to the ground-truth HRTFs and a lower LSD than the boundary element method (BEM) simulation of HRTFs that the database provides. The localization simulation results with an auditory model are also consistent with the objective evaluation metrics, showing the localization responses with our predicted HRTFs are significantly better than with the BEM calculated ones.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Rethinking Audio-visual Synchronization for Active Speaker Detection
Authors:
Abudukelimu Wuerkaixi,
You Zhang,
Zhiyao Duan,
Changshui Zhang
Abstract:
Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarif…
▽ More
Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
△ Less
Submitted 10 July, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Stochastic Event-triggered Variational Bayesian Filtering
Authors:
Xiaoxu Lv,
Peihu Duan,
Zhisheng Duan,
Guanrong Chen,
Ling Shi
Abstract:
This paper proposes an event-triggered variational Bayesian filter for remote state estimation with unknown and time-varying noise covariances. After presetting multiple nominal process noise covariances and an initial measurement noise covariance, a variational Bayesian method and a fixed-point iteration method are utilized to jointly estimate the posterior state vector and the unknown noise cova…
▽ More
This paper proposes an event-triggered variational Bayesian filter for remote state estimation with unknown and time-varying noise covariances. After presetting multiple nominal process noise covariances and an initial measurement noise covariance, a variational Bayesian method and a fixed-point iteration method are utilized to jointly estimate the posterior state vector and the unknown noise covariances under a stochastic event-triggered mechanism. The proposed algorithm ensures low communication loads and excellent estimation performances for a wide range of unknown noise covariances. Finally, the performance of the proposed algorithm is demonstrated by tracking simulations of a vehicle.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Smart City Intersections: Intelligence Nodes for Future Metropolises
Authors:
Zoran Kostić,
Alex Angus,
Zhengye Yang,
Zhuoxu Duan,
Ivan Seskar,
Gil Zussman,
Dipankar Raychaudhuri
Abstract:
Traffic intersections are the most suitable locations for the deployment of computing, communications, and intelligence services for smart cities of the future. The abundance of data to be collected and processed, in combination with privacy and security concerns, motivates the use of the edge-computing paradigm which aligns well with physical intersections in metropolises. This paper focuses on h…
▽ More
Traffic intersections are the most suitable locations for the deployment of computing, communications, and intelligence services for smart cities of the future. The abundance of data to be collected and processed, in combination with privacy and security concerns, motivates the use of the edge-computing paradigm which aligns well with physical intersections in metropolises. This paper focuses on high-bandwidth, low-latency applications, and in that context it describes: (i) system design considerations for smart city intersection intelligence nodes; (ii) key technological components including sensors, networking, edge computing, low latency design, and AI-based intelligence; and (iii) applications such as privacy preservation, cloud-connected vehicles, a real-time "radar-screen", traffic management, and monitoring of pedestrian behavior during pandemics. The results of the experimental studies performed on the COSMOS testbed located in New York City are illustrated. Future challenges in designing human-centered smart city intersections are summarized.
△ Less
Submitted 13 May, 2022; v1 submitted 3 May, 2022;
originally announced May 2022.