-
Channel Estimation, Interpolation and Extrapolation in Doubly-dispersive Channels
Authors:
Zijun Gong,
Fan Jiang,
Yuhui Song,
Cheng Li,
Xiaofeng Tao
Abstract:
The OTFS (Orthogonal Time Frequency Space) is widely acknowledged for its ability to combat Doppler spread in time-varying channels. In this paper, another advantage of OTFS over OFDM (Orthogonal Frequency Division Multiplexing) will be demonstrated: much reduced channel training overhead. Specifically, the sparsity of the channel in delay-Doppler (D-D) domain implies strong correlation of channel…
▽ More
The OTFS (Orthogonal Time Frequency Space) is widely acknowledged for its ability to combat Doppler spread in time-varying channels. In this paper, another advantage of OTFS over OFDM (Orthogonal Frequency Division Multiplexing) will be demonstrated: much reduced channel training overhead. Specifically, the sparsity of the channel in delay-Doppler (D-D) domain implies strong correlation of channel gains in time-frequency (T-F) domain, which can be harnessed to reduce channel training overhead through interpolation. An immediate question is how much training overhead is needed in doubly-dispersive channels? A conventional belief is that the overhead is only dependent on the product of delay and Doppler spreads, but we will show that it's also dependent on the T-F window size. The finite T-F window leads to infinite spreading in D-D domain, and aliasing will be inevitable after sampling in T-F domain. Two direct consequences of the aliasing are increased channel training overhead and interference. Another factor contributing to channel estimation error is the inter-symbol-carrier-interference (ISCI), resulting from the uncertainty principle. Both aliasing and ISCI are considered in channel modelling, a low-complexity algorithm is proposed for channel estimation and interpolation through FFT. A large T-F window is necessary for reduced channel training overhead and aliasing, but increases processing delay. Fortunately, we show that the proposed algorithm can be implemented in a pipeline fashion. Further more, we showed that data-aided channel tracking is possible in D-D domain to further reduce the channel estimation frequency, i.e., channel extrapolation. The impacts of aliasing and ISCI on channel interpolation error are analyzed.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection
Authors:
Pengfei Cai,
Yan Song,
Kang Li,
Haoyu Song,
Ian McLoughlin
Abstract:
Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training,…
▽ More
Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.
△ Less
Submitted 19 August, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Iterative Equalization of CPM With Unitary Approximate Message Passing
Authors:
Zilong Liu,
Yi Song,
Qinghua Guo,
Peng Sun,
Kexian Gong,
Zhongyong Wang
Abstract:
Continuous phase modulation (CPM) has extensive applications in wireless communications due to its high spectral and power efficiency. However, its nonlinear characteristics pose significant challenges for detection in frequency selective fading channels. This paper proposes an iterative receiver tailored for the detection of CPM signals over frequency selective fading channels. This design levera…
▽ More
Continuous phase modulation (CPM) has extensive applications in wireless communications due to its high spectral and power efficiency. However, its nonlinear characteristics pose significant challenges for detection in frequency selective fading channels. This paper proposes an iterative receiver tailored for the detection of CPM signals over frequency selective fading channels. This design leverages the factor graph framework to integrate equalization, demodulation, and decoding functions. The equalizer employs the unitary approximate message passing (UAMP) algorithm, while the unitary transformation is implemented using the fast Fourier transform (FFT) with the aid of a cyclic prefix (CP), thereby achieving low computational complexity while with high performance. For CPM demodulation and channel decoding, with belief propagation (BP), we design a message passing-based maximum a posteriori (MAP) algorithm, and the message exchange between the demodulator, decoder and equalizer is elaborated. With proper message passing schedules, the receiver can achieve fast convergence. Simulation results show that compared with existing turbo receivers, the proposed receiver delivers significant performance enhancement with low computational complexity.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
On Noise Resiliency of Neuromorphic Inferential Communication in Microgrids
Authors:
Yubo Song,
Subham Sahoo,
Xiaoguang Diao
Abstract:
Neuromorphic computing leveraging spiking neural network has emerged as a promising solution to tackle the security and reliability challenges with the conventional cyber-physical infrastructure of microgrids. Its event-driven paradigm facilitates promising prospect in resilient and energy-efficient coordination among power electronic converters. However, different from biological neurons that are…
▽ More
Neuromorphic computing leveraging spiking neural network has emerged as a promising solution to tackle the security and reliability challenges with the conventional cyber-physical infrastructure of microgrids. Its event-driven paradigm facilitates promising prospect in resilient and energy-efficient coordination among power electronic converters. However, different from biological neurons that are focused in the literature, microgrids exhibit distinct architectures and features, implying potentially diverse adaptability in its capabilities to dismiss information transfer, which remains largely unrevealed. One of the biggest drawbacks in the information transfer theory is the impact of noise in the signaling accuracy. Hence, this article hereby explores the noise resiliency of neuromorphic inferential communication in microgrids through case studies and underlines potential challenges and solutions as extensions beyond the results, thus offering insights for its implementation in real-world scenarios.
△ Less
Submitted 25 July, 2024;
originally announced August 2024.
-
Language Model Can Listen While Speaking
Authors:
Ziyang Ma,
Yakun Song,
Chenpeng Du,
Jian Cong,
Zhuo Chen,
Yuping Wang,
Yuxuan Wang,
Xie Chen
Abstract:
Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisf…
▽ More
Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
A Novel Edge Laplacian-based Approach for Adaptive Formation Control of Uncertain Multi-agent Systems with Unified Relative Error Performance
Authors:
Kun Li,
Kai Zhao,
Yongduan Song,
Lihua Xie
Abstract:
For most existing prescribed performance formation control methods, performance requirements are not directly imposed on the relative states between agents but on the consensus error, which lacks a clear physical interpretation of their solution. In this paper, we propose a novel adaptive prescribed performance formation control strategy, capable of guaranteeing prescribed performance on the relat…
▽ More
For most existing prescribed performance formation control methods, performance requirements are not directly imposed on the relative states between agents but on the consensus error, which lacks a clear physical interpretation of their solution. In this paper, we propose a novel adaptive prescribed performance formation control strategy, capable of guaranteeing prescribed performance on the relative errors, for uncertain high-order multi-agent systems under a class of directed graphs. Due to the consideration of performance constraints for relative errors, a coupled nonlinear interaction term that contains global graphic information among agents is involved in the error dynamics, leading to a fully distributed control design more difficult and challenging. Here by proposing a series of nonlinear mappings and utilizing the edge Laplacian along with Lyapunov stability theory, the presented formation control scheme exhibits the following appealing features when compared to existing results: 1) different performance requirements can be guaranteed in a unified way by solely tuning the design parameters a priori, without the need for control redesign and stability reanalysis under the proposed fixed control protocol, making the design more user-friendly and the implementation less demanding; 2) the complex and burdensome verification process for the initial constraint, often encountered in existing prescribed performance controls, is completely obviated if the performance requirements are global; and 3) nonlinear interaction is completely decoupled and the asymptotic stability of the formation manifold is ensured via using the adaptive parameter estimate technique. Finally, simulations of various performance behaviors are performed to show the efficiency of the theoretical results.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
An Interface Method for Co-simulation of EMT Model and Shifted Frequency EMT Model Based on Rotational Invariance Techniques
Authors:
Shilin Gao,
Ying Chen,
Zhitong Yu,
Wensheng Chen,
Yankan Song
Abstract:
The shifted frequency-based electromagnetic transient (SFEMT) simulation has greatly improved the computational efficiency of traditional electromagnetic transient (EMT) simulation for the ac grid. This letter proposes a novel interface for the co-simulation of the SFEMT model and the traditional EMT model. The general form of SFEMT modeling and the principle of analytical signal construction are…
▽ More
The shifted frequency-based electromagnetic transient (SFEMT) simulation has greatly improved the computational efficiency of traditional electromagnetic transient (EMT) simulation for the ac grid. This letter proposes a novel interface for the co-simulation of the SFEMT model and the traditional EMT model. The general form of SFEMT modeling and the principle of analytical signal construction are first derived. Then, an interface for the co-simulation of EMT and SFEMT simulation is proposed based on rotational invariance techniques. Theoretical analyses and test results demonstrate the effectiveness of the proposed method.
△ Less
Submitted 27 August, 2024; v1 submitted 21 July, 2024;
originally announced July 2024.
-
Inferring Ingrained Remote Information in AC Power Flows Using Neuromorphic Modality Regime
Authors:
Xiaoguang Diao,
Yubo Song,
Subham Sahoo
Abstract:
In this paper, we infer remote measurements such as remote voltages and currents online with change in AC power flows using spiking neural network (SNN) as grid-edge technology for efficient coordination of power electronic converters. This work unifies power and information as a means of data normalization using a multi-modal regime in the form of spikes using energy-efficient neuromorphic learni…
▽ More
In this paper, we infer remote measurements such as remote voltages and currents online with change in AC power flows using spiking neural network (SNN) as grid-edge technology for efficient coordination of power electronic converters. This work unifies power and information as a means of data normalization using a multi-modal regime in the form of spikes using energy-efficient neuromorphic learning and event-driven asynchronous data collection. Firstly, we organize the synchronous real-valued measurements at each edge and translate them into asynchronous spike-based events to collect sparse data for training of SNN at each edge. Instead of relying on error-dependent supervised data-driven learning theory, we exploit the latency-driven unsupervised Hebbian learning rule to obtain modulation pulses for switching of power electronic converters that can now comprehend grid disturbances locally and adapt their operation without requiring explicit infrastructure for global coordination. Not only does this philosophy block exogenous path arrival for cyber attackers by dismissing the cyber layer, it also entails converter adaptation to system reconfiguration and parameter mismatch issues. We conclude this work by validating its energy-efficient and effective online learning performance under various scenarios in different system sizes, including modified IEEE 14-bus system and under experimental conditions.
△ Less
Submitted 9 August, 2024; v1 submitted 20 July, 2024;
originally announced July 2024.
-
Hidden Convexity-Based Distributed Operation of Integrated Electricity-Gas Systems
Authors:
Rong-Peng Liu,
Yue Song,
Junhong Liu,
Xiaozhe Wang,
Jinpeng Guo,
Yunhe Hou
Abstract:
We propose a hidden convexity-based method to address distributed optimal energy flow (OEF) problems for transmission-level integrated electricity-gas systems. First, we develop a node-wise decoupling method to de-compose an OEF problem into multiple OEF subproblems. Then, we propose a hidden convexity-based method to equivalently reformulate nonconvex OEF subproblems as semi-definite programs. Th…
▽ More
We propose a hidden convexity-based method to address distributed optimal energy flow (OEF) problems for transmission-level integrated electricity-gas systems. First, we develop a node-wise decoupling method to de-compose an OEF problem into multiple OEF subproblems. Then, we propose a hidden convexity-based method to equivalently reformulate nonconvex OEF subproblems as semi-definite programs. This method differs from any ap-proximation and convexification methods that may incur infeasible solutions. Since all OEF subproblems are origi-nally convex or equivalently convexified, we adopt an ADMM to solve the hidden convexity-based distributed OEF problem with convergence analysis. Test results validate the effectiveness of the proposed method, especially in handling a large number of agents.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Compressed Sensing Inspired User Acquisition for Downlink Integrated Sensing and Communication Transmissions
Authors:
Yi Song,
Fernando Pedraza,
Shuangyang Li,
Siyao Li,
Han Yu,
Giuseppe Caire
Abstract:
This paper investigates radar-assisted user acquisition for downlink multi-user multiple-input multiple-output (MIMO) transmission using Orthogonal Frequency Division Multiplexing (OFDM) signals. Specifically, we formulate a concise mathematical model for the user acquisition problem, where each user is characterized by its delay and beamspace response. Therefore, we propose a two-stage method for…
▽ More
This paper investigates radar-assisted user acquisition for downlink multi-user multiple-input multiple-output (MIMO) transmission using Orthogonal Frequency Division Multiplexing (OFDM) signals. Specifically, we formulate a concise mathematical model for the user acquisition problem, where each user is characterized by its delay and beamspace response. Therefore, we propose a two-stage method for user acquisition, where the Multiple Signal Classification (MUSIC) algorithm is adopted for delay estimation, and then a least absolute shrinkage and selection operator (LASSO) is applied for estimating the user response in the beamspace. Furthermore, we also provide a comprehensive performance analysis of the considered problem based on the pair-wise error probability (PEP). Particularly, we show that the rank and the geometric mean of non-zero eigenvalues of the squared beamspace difference matrix determines the user acquisition performance. More importantly, we reveal that simultaneously probing multiple beams outperforms concentrating power on a specific beam direction in each time slot under the power constraint, when only limited OFDM symbols are transmitted. Our numerical results confirm our conclusions and also demonstrate a promising acquisition performance of the proposed two-stage method.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Safe Reinforcement Learning for Power System Control: A Review
Authors:
Peipei Yu,
Zhenyi Wang,
Hongcai Zhang,
Yonghua Song
Abstract:
The large-scale integration of intermittent renewable energy resources introduces increased uncertainty and volatility to the supply side of power systems, thereby complicating system operation and control. Recently, data-driven approaches, particularly reinforcement learning (RL), have shown significant promise in addressing complex control challenges in power systems, because RL can learn from i…
▽ More
The large-scale integration of intermittent renewable energy resources introduces increased uncertainty and volatility to the supply side of power systems, thereby complicating system operation and control. Recently, data-driven approaches, particularly reinforcement learning (RL), have shown significant promise in addressing complex control challenges in power systems, because RL can learn from interactive feedback without needing prior knowledge of the system model. However, the training process of model-free RL methods relies heavily on random decisions for exploration, which may result in ``bad" decisions that violate critical safety constraints and lead to catastrophic control outcomes. Due to the inability of RL methods to theoretically ensure decision safety in power systems, directly deploying traditional RL algorithms in the real world is deemed unacceptable. Consequently, the safety issue in RL applications, known as safe RL, has garnered considerable attention in recent years, leading to numerous important developments. This paper provides a comprehensive review of the state-of-the-art safe RL techniques and discusses how these techniques can be applied to power system control problems such as frequency regulation, voltage control, and energy management. We then present discussions on key challenges and future research directions, related to convergence and optimality, training efficiency, universality, and real-world deployment.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction
Authors:
Yuxuan Zhang,
T. M. Sazzad,
Yangyang Song,
Spencer J. Chang,
Ritesh Chowdhry,
Tomas Mejia,
Anna Hampton,
Shelby Kucharski,
Stefan Gerber,
Barry Tillman,
Marcio F. R. Resende,
William M. Hammond,
Chris H. Wilson,
Alina Zare,
Sanjeev J. Koppal
Abstract:
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient…
▽ More
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera's data as ground truth and our camera's data as input. We demonstrated our camera's ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Authors:
Yakun Song,
Zhuo Chen,
Xiaofei Wang,
Ziyang Ma,
Guanrou Yang,
Xie Chen
Abstract:
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, T…
▽ More
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Semi-supervised variational autoencoder for cell feature extraction in multiplexed immunofluorescence images
Authors:
Piumi Sandarenu,
Julia Chen,
Iveta Slapetova,
Lois Browne,
Peter H. Graham,
Alexander Swarbrick,
Ewan K. A. Millar,
Yang Song,
Erik Meijering
Abstract:
Advancements in digital imaging technologies have sparked increased interest in using multiplexed immunofluorescence (mIF) images to visualise and identify the interactions between specific immunophenotypes with the tumour microenvironment at the cellular level. Current state-of-the-art multiplexed immunofluorescence image analysis pipelines depend on cell feature representations characterised by…
▽ More
Advancements in digital imaging technologies have sparked increased interest in using multiplexed immunofluorescence (mIF) images to visualise and identify the interactions between specific immunophenotypes with the tumour microenvironment at the cellular level. Current state-of-the-art multiplexed immunofluorescence image analysis pipelines depend on cell feature representations characterised by morphological and stain intensity-based metrics generated using simple statistical and machine learning-based tools. However, these methods are not capable of generating complex representations of cells. We propose a deep learning-based cell feature extraction model using a variational autoencoder with supervision using a latent subspace to extract cell features in mIF images. We perform cell phenotype classification using a cohort of more than 44,000 multiplexed immunofluorescence cell image patches extracted across 1,093 tissue microarray cores of breast cancer patients, to demonstrate the success of our model against current and alternative methods.
△ Less
Submitted 27 June, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech
Authors:
Wenbin Wang,
Yang Song,
Sanjay Jha
Abstract:
This paper introduces GLOBE, a high-quality English corpus with worldwide accents, specifically designed to address the limitations of current zero-shot speaker adaptive Text-to-Speech (TTS) systems that exhibit poor generalizability in adapting to speakers with accents. Compared to commonly used English corpora, such as LibriTTS and VCTK, GLOBE is unique in its inclusion of utterances from 23,519…
▽ More
This paper introduces GLOBE, a high-quality English corpus with worldwide accents, specifically designed to address the limitations of current zero-shot speaker adaptive Text-to-Speech (TTS) systems that exhibit poor generalizability in adapting to speakers with accents. Compared to commonly used English corpora, such as LibriTTS and VCTK, GLOBE is unique in its inclusion of utterances from 23,519 speakers and covers 164 accents worldwide, along with detailed metadata for these speakers. Compared to its original corpus, i.e., Common Voice, GLOBE significantly improves the quality of the speech data through rigorous filtering and enhancement processes, while also populating all missing speaker metadata. The final curated GLOBE corpus includes 535 hours of speech data at a 24 kHz sampling rate. Our benchmark results indicate that the speaker adaptive TTS model trained on the GLOBE corpus can synthesize speech with better speaker similarity and comparable naturalness than that trained on other popular corpora. We will release GLOBE publicly after acceptance. The GLOBE dataset is available at https://globecorpus.github.io/.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growth
Authors:
Yeongjun Jang,
Joowon Lee,
Seonhong Min,
Hyesun Kwak,
Junsoo Kim,
Yongsoo Song
Abstract:
In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a pol…
▽ More
In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
Authors:
Do Hyun Lee,
Yoonah Song,
Hong Kook Kim
Abstract:
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for c…
▽ More
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Machine learning-based Near-field Emitter Localization via Grouped Hybrid Analog and Digital Massive MIMO Receive Array
Authors:
Yifan Li,
Feng Shu,
Jiatong Bai,
Cunhua Pan,
Yongpeng Wu,
Yaoliang Song,
Jiangzhou Wang
Abstract:
A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF…
▽ More
A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF DOA estimation, which divides the large-scale receive array into small-scale groups and each group contains several subarrays. Thus the NF direction-of-arrival (DOA) estimation problem is viewed as far-field (FF) within each group, and some existing methods such as MUSIC, Root-MUSIC, ESPRIT, etc., can be adopted. Then by angle calibration, a candidate position set is generated. To eliminate the phase ambiguity arising from the HAD structure and obtain the emitter position, two low-complexity clustering-based methods, minimum sample distance clustering (MSDC) and range scatter diagram (RSD) - angle scatter diagram (ASD)-based DBSCAN (RSD-ASD-DBSCAN), are proposed based on the distribution features of samples in the candidate position set. Then to further improve the localization accuracy, a model-driven regression network (RegNet) is designed, which consists of a multi-layer neural network (MLNN) for false solution elimination and a perceptron for angle fusion. Finally, the Cramer-Rao lower bound (CRLB) of NF emitter localization for the proposed grouped HAD structure is also derived. The simulation results show that the proposed methods can achieve CRLB at different SNR regions, the RegNet has great performance advantages at low SNR regions and the clustering-based methods have much lower complexity.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
WMAdapter: Adding WaterMark Control to Latent Diffusion Models
Authors:
Hai Ci,
Yiren Song,
Pei Yang,
Jinheng Xie,
Mike Zheng Shou
Abstract:
Watermarking is crucial for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that takes user-specified watermark information and allows for seamless watermark imprinting during the diffusion generation process. WMAdapter is efficient and robust, with a strong emphasis on high generation quality. To achieve this, we make two key designs: (1)…
▽ More
Watermarking is crucial for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that takes user-specified watermark information and allows for seamless watermark imprinting during the diffusion generation process. WMAdapter is efficient and robust, with a strong emphasis on high generation quality. To achieve this, we make two key designs: (1) We develop a contextual adapter structure that is lightweight and enables effective knowledge transfer from heavily pretrained post-hoc watermarking models. (2) We introduce an extra finetuning step and design a hybrid finetuning strategy to further improve image quality and eliminate tiny artifacts. Empirical results demonstrate that WMAdapter offers strong flexibility, exceptional image generation quality and competitive watermark robustness.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Aligning Large Language Models with Representation Editing: A Control Perspective
Authors:
Lingkai Kong,
Haorui Wang,
Wenhao Mu,
Yuanqi Du,
Yuchen Zhuang,
Yifei Zhou,
Yue Song,
Rongzhi Zhang,
Kai Wang,
Chao Zhang
Abstract:
Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabi…
▽ More
Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods.
△ Less
Submitted 11 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
S-CycleGAN: Semantic Segmentation Enhanced CT-Ultrasound Image-to-Image Translation for Robotic Ultrasonography
Authors:
Yuhan Song,
Nak Young Chong
Abstract:
Ultrasound imaging is pivotal in various medical diagnoses due to its non-invasive nature and safety. In clinical practice, the accuracy and precision of ultrasound image analysis are critical. Recent advancements in deep learning are showing great capacity of processing medical images. However, the data hungry nature of deep learning and the shortage of high-quality ultrasound image training data…
▽ More
Ultrasound imaging is pivotal in various medical diagnoses due to its non-invasive nature and safety. In clinical practice, the accuracy and precision of ultrasound image analysis are critical. Recent advancements in deep learning are showing great capacity of processing medical images. However, the data hungry nature of deep learning and the shortage of high-quality ultrasound image training data suppress the development of deep learning based ultrasound analysis methods. To address these challenges, we introduce an advanced deep learning model, dubbed S-CycleGAN, which generates high-quality synthetic ultrasound images from computed tomography (CT) data. This model incorporates semantic discriminators within a CycleGAN framework to ensure that critical anatomical details are preserved during the style transfer process. The synthetic images are utilized to enhance various aspects of our development of the robot-assisted ultrasound scanning system. The data and code will be available at https://github.com/yhsong98/ct-us-i2i-translation.
△ Less
Submitted 22 August, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Automatic diagnosis of cardiac magnetic resonance images based on semi-supervised learning
Authors:
Hejun Huang,
Zuguo Chen,
Yi Huang,
Guangqiang Luo,
Chaoyang Chen,
Youzhi Song
Abstract:
Cardiac magnetic resonance imaging (MRI) is a pivotal tool for assessing cardiac function. Precise segmentation of cardiac structures is imperative for accurate cardiac functional evaluation. This paper introduces a semi-supervised model for automatic segmentation of cardiac images and auxiliary diagnosis. By harnessing cardiac MRI images and necessitating only a small portion of annotated image d…
▽ More
Cardiac magnetic resonance imaging (MRI) is a pivotal tool for assessing cardiac function. Precise segmentation of cardiac structures is imperative for accurate cardiac functional evaluation. This paper introduces a semi-supervised model for automatic segmentation of cardiac images and auxiliary diagnosis. By harnessing cardiac MRI images and necessitating only a small portion of annotated image data, the model achieves fully automated, high-precision segmentation of cardiac images, extraction of features, calculation of clinical indices, and prediction of diseases. The provided segmentation results, clinical indices, and prediction outcomes can aid physicians in diagnosis, thereby serving as auxiliary diagnostic tools. Experimental results showcase that this semi-supervised model for automatic segmentation of cardiac images and auxiliary diagnosis attains high accuracy in segmentation and correctness in prediction, demonstrating substantial practical guidance and application value.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
Authors:
Wenbin Wang,
Yang Song,
Sanjay Jha
Abstract:
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been e…
▽ More
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Distributed Matrix Pencil Formulations for Prescribed-Time Leader-Following Consensus of MASs with Unknown Sensor Sensitivity
Authors:
Hefu Ye,
Changyun Wen,
Yongduan Song
Abstract:
In this paper, we address the problem of prescribed-time leader-following consensus of heterogeneous multi-agent systems (MASs) in the presence of unknown sensor sensitivity. Under a connected undirected topology, we propose a time-varying dual observer/controller design framework that makes use of regular local and inaccurate feedback to achieve consensus tracking within a prescribed time. In par…
▽ More
In this paper, we address the problem of prescribed-time leader-following consensus of heterogeneous multi-agent systems (MASs) in the presence of unknown sensor sensitivity. Under a connected undirected topology, we propose a time-varying dual observer/controller design framework that makes use of regular local and inaccurate feedback to achieve consensus tracking within a prescribed time. In particular, the developed analysis framework is applicable to MASs equipped with sensors of different sensitivities. One of the design innovations involves constructing a distributed matrix pencil formulation based on worst-case sensors, yielding control parameters with sufficient robustness yet relatively low conservatism. Another novelty is the construction of the control gains, which consists of the product of a proportional coefficient obtained from the matrix pencil formulation and a classic time-varying function that grows to infinity or a novel bounded time-varying function. Furthermore, it is possible to extend the prescribed-time distributed protocol to infinite time domain by introducing the bounded time-varying gain technique without sacrificing the ultimate control accuracy, and the corresponding technical proof is comprehensive. The effectiveness of the method is demonstrated through a group of 5 single-link robot manipulators.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Self-Adjusting Prescribed Performance Control for Nonlinear Systems with Input Saturation
Authors:
Zhuwu Shao,
Yujuan Wang,
Huanyu Yang,
Yongduan Song
Abstract:
Among the existing works on enhancing system performance via prescribed performance functions (PPFs), the decay rates of PPFs need to be predetermined by the designer, directly affecting the convergence time of the closed-loop system. However, if only considering accelerating the system convergence by selecting a big decay rate of the performance function, it may lead to the severe consequence of…
▽ More
Among the existing works on enhancing system performance via prescribed performance functions (PPFs), the decay rates of PPFs need to be predetermined by the designer, directly affecting the convergence time of the closed-loop system. However, if only considering accelerating the system convergence by selecting a big decay rate of the performance function, it may lead to the severe consequence of closed-loop system failure when considering the prevalent actuator saturation in practical scenarios. To address this issue, this work proposes a control scheme that can flexibly self-adjust the convergence rates of the performance functions (PFs), aiming to achieve faster steady-state convergence while avoiding the risk of error violation beyond the PFs' envelopes, which may arise from input saturation and improper decay rate selection in traditional prescribed performance control (PPC) methods. Specifically, a performance index function (PIF) is introduced as a reference criterion, based on which the self-adjusting rates of the PFs are designed for different cases, exhibiting the following appealing features: 1) It can eliminate the need to prespecify the initial values of the PFs. In addition, it can also accommodate arbitrary magnitudes of initial errors while avoiding excessive initial control efforts. 2) Considering actuator saturation, this method can not only reduce the decay rates of the PFs when necessary to avoid violation of the PFs, but also increase the decay rates to accelerate system convergence when there is remaining control capacity. Finally, several numerical simulations are conducted to confirm the effectiveness and superiority of the proposed method.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
BERT: Accelerating Vital Signs Measurement for Bioradar with An Efficient Recursive Technique
Authors:
Chengyao Tang,
Yongpeng Dai,
Zhi Li,
Yongping Song,
Fulai Liang,
Tian Jin
Abstract:
Recent years have witnessed the great advance of bioradar system in smart sensing of vital signs (VS) for human healthcare monitoring. As an important part of VS sensing process, VS measurement aims to capture the chest wall micromotion induced by the human respiratory and cardiac activities. Unfortunately, the existing VS measurement methods using bioradar have encountered bottlenecks in making a…
▽ More
Recent years have witnessed the great advance of bioradar system in smart sensing of vital signs (VS) for human healthcare monitoring. As an important part of VS sensing process, VS measurement aims to capture the chest wall micromotion induced by the human respiratory and cardiac activities. Unfortunately, the existing VS measurement methods using bioradar have encountered bottlenecks in making a trade-off between time cost and measurement accuracy. To break this bottleneck, this letter proposes an efficient recursive technique (BERT) heuristically, based on the observation that the features of bioradar VS meet the conditions of Markov model. Extensive experimental results validate that BERT measurement yields lower time costs, competitive estimates of heart rate, breathing rate, and heart rate variability. Our BERT method is promising us a new and superior option to measure VS for bioradar. This work seeks not only to solve the current issue of how to accelerate VS measurement with an acceptable accuracy, but also to inspire creative new ideas that spur further advances in this promising field in the future.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
A Gray-Box Stability Analysis Mechanism for Power Electronic Converters
Authors:
Rui Kong,
Subham Sahoo,
Yubo Song,
Frede Blaabjerg
Abstract:
This paper proposes a gray-box stability analysis mechanism based on data-driven dynamic mode decomposition (DMD) for commercial grid-tied power electronics converters with limited information on its control parameters and topology. By fusing the underlying physical constraints of the state equations into data snapshots, the system dynamic state matrix and input matrix are simultaneously approxima…
▽ More
This paper proposes a gray-box stability analysis mechanism based on data-driven dynamic mode decomposition (DMD) for commercial grid-tied power electronics converters with limited information on its control parameters and topology. By fusing the underlying physical constraints of the state equations into data snapshots, the system dynamic state matrix and input matrix are simultaneously approximated to identify the dominant system dynamic modes and eigenvalues using the DMD with control (DMDc) algorithm. While retaining the advantages of eliminating the need for intrinsic controller information, the proposed gray-box method establishes higher accuracy and interpretable outcomes over the conventional DMD method. Finally, under experimental conditions of a low-frequency oscillation scenario in electrified railways featuring a single-phase converter, the proposed gray-box DMDc is verified to identify the dominant eigenvalues more accurately.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Net 835-Gb/s/λ Carrier- and LO-Free 100-km Transmission Using Channel-Aware Phase Retrieval Reception
Authors:
Hanzi Huang,
Haoshuo Chen,
Qian Hu,
Di Che,
Yetian Huang,
Brian Stern,
Nicolas K. Fontaine,
Mikael Mazur,
Lauren Dallachiesa,
Roland Ryf,
Zhengxuan Li,
Yingxiong Song
Abstract:
We experimentally demonstrate the first carrier- and LO-free 800G/λ receiver enabling direct compatibility with standard coherent transmitters via phase retrieval, achieving net 835-Gb/s transmission over 100-km SMF and record 8.27-b/s/Hz net optical spectral efficiency.
We experimentally demonstrate the first carrier- and LO-free 800G/λ receiver enabling direct compatibility with standard coherent transmitters via phase retrieval, achieving net 835-Gb/s transmission over 100-km SMF and record 8.27-b/s/Hz net optical spectral efficiency.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images
Authors:
Meilin Wang,
Yexing Song,
Pengxu Wei,
Xiaoyu Xian,
Yukai Shi,
Liang Lin
Abstract:
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of…
▽ More
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Learning at the Speed of Wireless: Online Real-Time Learning for AI-Enabled MIMO in NextG
Authors:
Jiarui Xu,
Shashank Jere,
Yifei Song,
Yi-Hung Kao,
Lizhong Zheng,
Lingjia Liu
Abstract:
Integration of artificial intelligence (AI) and machine learning (ML) into the air interface has been envisioned as a key technology for next-generation (NextG) cellular networks. At the air interface, multiple-input multiple-output (MIMO) and its variants such as multi-user MIMO (MU-MIMO) and massive/full-dimension MIMO have been key enablers across successive generations of cellular networks wit…
▽ More
Integration of artificial intelligence (AI) and machine learning (ML) into the air interface has been envisioned as a key technology for next-generation (NextG) cellular networks. At the air interface, multiple-input multiple-output (MIMO) and its variants such as multi-user MIMO (MU-MIMO) and massive/full-dimension MIMO have been key enablers across successive generations of cellular networks with evolving complexity and design challenges. Initiating active investigation into leveraging AI/ML tools to address these challenges for MIMO becomes a critical step towards an AI-enabled NextG air interface. At the NextG air interface, the underlying wireless environment will be extremely dynamic with operation adaptations performed on a sub-millisecond basis by MIMO operations such as MU-MIMO scheduling and rank/link adaptation. Given the enormously large number of operation adaptation possibilities, we contend that online real-time AI/ML-based approaches constitute a promising paradigm. To this end, we outline the inherent challenges and offer insights into the design of such online real-time AI/ML-based solutions for MIMO operations. An online real-time AI/ML-based method for MIMO-OFDM channel estimation is then presented, serving as a potential roadmap for developing similar techniques across various MIMO operations in NextG.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
A Holistic Power Optimization Approach for Microgrid Control Based on Deep Reinforcement Learning
Authors:
Fulong Yao,
Wanqing Zhao,
Matthew Forshaw,
Yang Song
Abstract:
The global energy landscape is undergoing a transformation towards decarbonization, sustainability, and cost-efficiency. In this transition, microgrid systems integrated with renewable energy sources (RES) and energy storage systems (ESS) have emerged as a crucial component. However, optimizing the operational control of such an integrated energy system lacks a holistic view of multiple environmen…
▽ More
The global energy landscape is undergoing a transformation towards decarbonization, sustainability, and cost-efficiency. In this transition, microgrid systems integrated with renewable energy sources (RES) and energy storage systems (ESS) have emerged as a crucial component. However, optimizing the operational control of such an integrated energy system lacks a holistic view of multiple environmental, infrastructural and economic considerations, not to mention the need to factor in the uncertainties from both the supply and demand. This paper presents a holistic datadriven power optimization approach based on deep reinforcement learning (DRL) for microgrid control considering the multiple needs of decarbonization, sustainability and cost-efficiency. First, two data-driven control schemes, namely the prediction-based (PB) and prediction-free (PF) schemes, are devised to formulate the control problem within a Markov decision process (MDP). Second, a multivariate objective (reward) function is designed to account for the market profits, carbon emissions, peak load, and battery degradation of the microgrid system. Third, we develop a Double Dueling Deep Q Network (D3QN) architecture to optimize the power flows for real-time energy management and determine charging/discharging strategies of ESS. Finally, extensive simulations are conducted to demonstrate the effectiveness and superiority of the proposed approach through a comparative analysis. The results and analysis also suggest the respective circumstances for using the two control schemes in practical implementations with uncertainties.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Neuromorphic Event-Driven Semantic Communication in Microgrids
Authors:
Xiaoguang Diao,
Yubo Song,
Subham Sahoo,
Yuan Li
Abstract:
Synergies between advanced communications, computing and artificial intelligence are unraveling new directions of coordinated operation and resiliency in microgrids. On one hand, coordination among sources is facilitated by distributed, privacy-minded processing at multiple locations, whereas on the other hand, it also creates exogenous data arrival paths for adversaries that can lead to cyber-phy…
▽ More
Synergies between advanced communications, computing and artificial intelligence are unraveling new directions of coordinated operation and resiliency in microgrids. On one hand, coordination among sources is facilitated by distributed, privacy-minded processing at multiple locations, whereas on the other hand, it also creates exogenous data arrival paths for adversaries that can lead to cyber-physical attacks amongst other reliability issues in the communication layer. This long-standing problem necessitates new intrinsic ways of exchanging information between converters through power lines to optimize the system's control performance. Going beyond the existing power and data co-transfer technologies that are limited by efficiency and scalability concerns, this paper proposes neuromorphic learning to implant communicative features using spiking neural networks (SNNs) at each node, which is trained collaboratively in an online manner simply using the power exchanges between the nodes. As opposed to the conventional neuromorphic sensors that operate with spiking signals, we employ an event-driven selective process to collect sparse data for training of SNNs. Finally, its multi-fold effectiveness and reliable performance is validated under simulation conditions with different microgrid topologies and components to establish a new direction in the sense-actuate-compute cycle for power electronic dominated grids and microgrids.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Troublemaker Learning for Low-Light Image Enhancement
Authors:
Yinghao Song,
Zhiyuan Cao,
Wanhong Xiang,
Sifan Long,
Bo Yang,
Hongwei Ge,
Yanchun Liang,
Chunguo Wu
Abstract:
Low-light image enhancement (LLIE) restores the color and brightness of underexposed images. Supervised methods suffer from high costs in collecting low/normal-light image pairs. Unsupervised methods invest substantial effort in crafting complex loss functions. We address these two challenges through the proposed TroubleMaker Learning (TML) strategy, which employs normal-light images as inputs for…
▽ More
Low-light image enhancement (LLIE) restores the color and brightness of underexposed images. Supervised methods suffer from high costs in collecting low/normal-light image pairs. Unsupervised methods invest substantial effort in crafting complex loss functions. We address these two challenges through the proposed TroubleMaker Learning (TML) strategy, which employs normal-light images as inputs for training. TML is simple: we first dim the input and then increase its brightness. TML is based on two core components. First, the troublemaker model (TM) constructs pseudo low-light images from normal images to relieve the cost of pairwise data. Second, the predicting model (PM) enhances the brightness of pseudo low-light images. Additionally, we incorporate an enhancing model (EM) to further improve the visual performance of PM outputs. Moreover, in LLIE tasks, characterizing global element correlations is important because more information on the same object can be captured. CNN cannot achieve this well, and self-attention has high time complexity. Accordingly, we propose Global Dynamic Convolution (GDC) with O(n) time complexity, which essentially imitates the partial calculation process of self-attention to formulate elementwise correlations. Based on the GDC module, we build the UGDC model. Extensive quantitative and qualitative experiments demonstrate that UGDC trained with TML can achieve competitive performance against state-of-the-art approaches on public datasets. The code is available at https://github.com/Rainbowman0/TML_LLIE.
△ Less
Submitted 2 March, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
ToMoBrush: Exploring Dental Health Sensing using a Sonic Toothbrush
Authors:
Kuang Yuan,
Mohamed Ibrahim,
Yiwen Song,
Guoxiang Deng,
Suvendra Vijayan,
Robert Nerone,
Akshay Gadre,
Swarun Kumar
Abstract:
Early detection of dental disease is crucial to prevent adverse outcomes. Today, dental X-rays are currently the most accurate gold standard for dental disease detection. Unfortunately, regular X-ray exam is still a privilege for billions of people around the world. In this paper, we ask: "Can we develop a low-cost sensing system that enables dental self-examination in the comfort of one's home?"…
▽ More
Early detection of dental disease is crucial to prevent adverse outcomes. Today, dental X-rays are currently the most accurate gold standard for dental disease detection. Unfortunately, regular X-ray exam is still a privilege for billions of people around the world. In this paper, we ask: "Can we develop a low-cost sensing system that enables dental self-examination in the comfort of one's home?"
This paper presents ToMoBrush, a dental health sensing system that explores using off-the-shelf sonic toothbrushes for dental condition detection. Our solution leverages the fact that a sonic toothbrush produces rich acoustic signals when in contact with teeth, which contain important information about each tooth's status. ToMoBrush extracts tooth resonance signatures from the acoustic signals to characterize varied dental health conditions of the teeth. We evaluate ToMoBrush on 19 participants and dental-standard models for detecting common dental problems including caries, calculus, and food impaction, achieving a detection ROC-AUC of 0.90, 0.83, and 0.88 respectively. Interviews with dental experts validate ToMoBrush's potential in enhancing at-home dental healthcare.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
GarchingSim: An Autonomous Driving Simulator with Photorealistic Scenes and Minimalist Workflow
Authors:
Liguo Zhou,
Yinglei Song,
Yichao Gao,
Zhou Yu,
Michael Sodamin,
Hongshen Liu,
Liang Ma,
Lian Liu,
Hao Liu,
Yang Liu,
Haichuan Li,
Guang Chen,
Alois Knoll
Abstract:
Conducting real road testing for autonomous driving algorithms can be expensive and sometimes impractical, particularly for small startups and research institutes. Thus, simulation becomes an important method for evaluating these algorithms. However, the availability of free and open-source simulators is limited, and the installation and configuration process can be daunting for beginners and inte…
▽ More
Conducting real road testing for autonomous driving algorithms can be expensive and sometimes impractical, particularly for small startups and research institutes. Thus, simulation becomes an important method for evaluating these algorithms. However, the availability of free and open-source simulators is limited, and the installation and configuration process can be daunting for beginners and interdisciplinary researchers. We introduce an autonomous driving simulator with photorealistic scenes, meanwhile keeping a user-friendly workflow. The simulator is able to communicate with external algorithms through ROS2 or Socket.IO, making it compatible with existing software stacks. Furthermore, we implement a highly accurate vehicle dynamics model within the simulator to enhance the realism of the vehicle's physical effects. The simulator is able to serve various functions, including generating synthetic data and driving with machine learning-based algorithms. Moreover, we prioritize simplicity in the deployment process, ensuring that beginners find it approachable and user-friendly.
△ Less
Submitted 30 January, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
Hybrid Message Passing-Based Detectors for Uplink Grant-Free NOMA Systems
Authors:
Yi Song,
Yiwen Zhu,
Kun Chen-Hu,
Xinhua Lu,
Peng Sun,
Zhongyong Wang
Abstract:
This paper studies improving the detector performance which considers the activity state (AS) temporal correlation of the user equipments (UEs) in the time domain under the uplink grant-free non-orthogonal multiple access (GF-NOMA) system. The Bernoulli Gaussian-Markov chain (BG-MC) probability model is used for exploiting both the sparsity and slow change characteristic of the AS of the UE. The G…
▽ More
This paper studies improving the detector performance which considers the activity state (AS) temporal correlation of the user equipments (UEs) in the time domain under the uplink grant-free non-orthogonal multiple access (GF-NOMA) system. The Bernoulli Gaussian-Markov chain (BG-MC) probability model is used for exploiting both the sparsity and slow change characteristic of the AS of the UE. The GAMP Bernoulli Gaussian-Markov chain (GAMP-BG-MC) algorithm is proposed to improve the detector performance, which can utilize the bidirectional message passing between the neighboring time slots to fully exploit the temporally-correlated AS of the UE. Furthermore, the parameters of the BG-MC model can be updated adaptively during the estimation procedure with unknown system statistics. Simulation results show that the proposed algorithm can improve the detection accuracy compared with the existing methods while keeping the same order complexity.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Authors:
Yakun Song,
Zhuo Chen,
Xiaofei Wang,
Ziyang Ma,
Xie Chen
Abstract:
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grain…
▽ More
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms VALL-E in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanups. Audio samples are available at https://ereboas.github.io/ELLAV/.
△ Less
Submitted 14 January, 2024;
originally announced January 2024.
-
Multi-level graph learning for audio event classification and human-perceived annoyance rating prediction
Authors:
Yuanbo Hou,
Qiaoqiao Ren,
Siyang Song,
Yuxin Song,
Wenwu Wang,
Dick Botteldooren
Abstract:
WHO's report on environmental noise estimates that 22 M people suffer from chronic annoyance related to noise caused by audio events (AEs) from various sources. Annoyance may lead to health issues and adverse effects on metabolic and cognitive systems. In cities, monitoring noise levels does not provide insights into noticeable AEs, let alone their relations to annoyance. To create annoyance-relat…
▽ More
WHO's report on environmental noise estimates that 22 M people suffer from chronic annoyance related to noise caused by audio events (AEs) from various sources. Annoyance may lead to health issues and adverse effects on metabolic and cognitive systems. In cities, monitoring noise levels does not provide insights into noticeable AEs, let alone their relations to annoyance. To create annoyance-related monitoring, this paper proposes a graph-based model to identify AEs in a soundscape, and explore relations between diverse AEs and human-perceived annoyance rating (AR). Specifically, this paper proposes a lightweight multi-level graph learning (MLGL) based on local and global semantic graphs to simultaneously perform audio event classification (AEC) and human annoyance rating prediction (ARP). Experiments show that: 1) MLGL with 4.1 M parameters improves AEC and ARP results by using semantic node information in local and global context aware graphs; 2) MLGL captures relations between coarse and fine-grained AEs and AR well; 3) Statistical analysis of MLGL results shows that some AEs from different sources significantly correlate with AR, which is consistent with previous research on human perception of these sound sources.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Implementing Digital Twin in Field-Deployed Optical Networks: Uncertain Factors, Operational Guidance, and Field-Trial Demonstration
Authors:
Yuchen Song,
Min Zhang,
Yao Zhang,
Yan Shi,
Shikui Shen,
Bingli Guo,
Shanguo Huang,
Danshi Wang
Abstract:
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environmen…
▽ More
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environments, as opposed to controlled laboratory settings. This paper addresses this challenge by examining the uncertain factors behind the inaccuracy of digital twin in field-deployed optical networks from three main challenges and proposing operational guidance for implementing accurate digital twin in field-deployed optical networks. Through the proposed guidance, we demonstrate the effective implementation of digital twin in a field-trial C+L-band optical transmission link, showcasing its capabilities in performance recovery in a fiber cut scenario.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
High-performance cVEP-BCI under minimal calibration
Authors:
Yining Miao,
Nanlin Shi,
Changxing Huang,
Yonghao Song,
Xiaogang Chen,
Yijun Wang,
Xiaorong Gao
Abstract:
The ultimate goal of brain-computer interfaces (BCIs) based on visual modulation paradigms is to achieve high-speed performance without the burden of extensive calibration. Code-modulated visual evoked potential-based BCIs (cVEP-BCIs) modulated by broadband white noise (WN) offer various advantages, including increased communication speed, expanded encoding target capabilities, and enhanced coding…
▽ More
The ultimate goal of brain-computer interfaces (BCIs) based on visual modulation paradigms is to achieve high-speed performance without the burden of extensive calibration. Code-modulated visual evoked potential-based BCIs (cVEP-BCIs) modulated by broadband white noise (WN) offer various advantages, including increased communication speed, expanded encoding target capabilities, and enhanced coding flexibility. However, the complexity of the spatial-temporal patterns under broadband stimuli necessitates extensive calibration for effective target identification in cVEP-BCIs. Consequently, the information transfer rate (ITR) of cVEP-BCI under limited calibration usually stays around 100 bits per minute (bpm), significantly lagging behind state-of-the-art steady-state visual evoked potential-based BCIs (SSVEP-BCIs), which achieve rates above 200 bpm. To enhance the performance of cVEP-BCIs with minimal calibration, we devised an efficient calibration stage involving a brief single-target flickering, lasting less than a minute, to extract generalizable spatial-temporal patterns. Leveraging the calibration data, we developed two complementary methods to construct cVEP temporal patterns: the linear modeling method based on the stimulus sequence and the transfer learning techniques using cross-subject data. As a result, we achieved the highest ITR of 250 bpm under a minute of calibration, which has been shown to be comparable to the state-of-the-art SSVEP paradigms. In summary, our work significantly improved the cVEP performance under few-shot learning, which is expected to expand the practicality and usability of cVEP-BCIs.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA Images
Authors:
Syed Farhan Abbas,
Nguyen Thanh Duc,
Yoonguu Song,
Kyungwon Kim,
Ekta Srivastava,
Boreom Lee
Abstract:
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural ne…
▽ More
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke
△ Less
Submitted 19 June, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
A Novel Tree Model-based DNN to Achieve a High-Resolution DOA Estimation via Massive MIMO receive array
Authors:
Yifan Li,
Feng Shu,
Jun Zou,
Wei Gao,
Yaoliang Song,
Jiangzhou Wang
Abstract:
To satisfy the high-resolution requirements of direction-of-arrival (DOA) estimation, conventional deep neural network (DNN)-based methods using grid idea need to significantly increase the number of output classifications and also produce a huge high model complexity. To address this problem, a multi-level tree-based DNN model (TDNN) is proposed as an alternative, where each level takes small-sca…
▽ More
To satisfy the high-resolution requirements of direction-of-arrival (DOA) estimation, conventional deep neural network (DNN)-based methods using grid idea need to significantly increase the number of output classifications and also produce a huge high model complexity. To address this problem, a multi-level tree-based DNN model (TDNN) is proposed as an alternative, where each level takes small-scale multi-layer neural networks (MLNNs) as nodes to divide the target angular interval into multiple sub-intervals, and each output class is associated to a MLNN at the next level. Then the number of MLNNs is gradually increasing from the first level to the last level, and so increasing the depth of tree will dramatically raise the number of output classes to improve the estimation accuracy. More importantly, this network is extended to make a multi-emitter DOA estimation. Simulation results show that the proposed TDNN performs much better than conventional DNN and root-MUSIC at extremely low signal-to-noise ratio (SNR), and can achieve Cramer-Rao lower bound (CRLB). Additionally, in the multi-emitter scenario, the proposed Q-TDNN has also made a substantial performance enhancement over DNN and Root-MUSIC, and this gain grows as the number of emitters increases.
△ Less
Submitted 12 March, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
SponTTS: modeling and transferring spontaneous style for TTS
Authors:
Hanzhao Li,
Xinfa Zhu,
Liumeng Xue,
Yang Song,
Yunlin Chen,
Lei Xie
Abstract:
Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous d…
▽ More
Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.
△ Less
Submitted 8 January, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Causality-based Cost Allocation for Peer-to-Peer Energy Trading in Distribution System
Authors:
Hyun Joong Kim,
Yong Hyun Song,
Jip Kim
Abstract:
While peer-to-peer energy trading has the potential to harness the capabilities of small-scale energy resources, a peer-matching process often overlooks power grid conditions, yielding increased losses, line congestion, and voltage problems. This imposes a great challenge on the distribution system operator (DSO), which can eventually limit peer-to-peer energy trading. To align the peer-matching p…
▽ More
While peer-to-peer energy trading has the potential to harness the capabilities of small-scale energy resources, a peer-matching process often overlooks power grid conditions, yielding increased losses, line congestion, and voltage problems. This imposes a great challenge on the distribution system operator (DSO), which can eventually limit peer-to-peer energy trading. To align the peer-matching process with the physical grid conditions, this paper proposes a cost causality-based network cost allocation method and the grid-aware peer-matching process. Building on the cost causality principle, the proposed model utilizes the network cost (loss, congestion, and voltage) as a signal to encourage peers to adjust their preferences ensuring that matches are more in line with grid conditions, leading to enhanced social welfare. Additionally, this paper presents mathematical proof showing the superiority of the causality-based cost allocation over existing methods.
△ Less
Submitted 20 February, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Distortion-Aware Phase Retrieval Receiver for High-Order QAM Transmission with Carrierless Intensity-Only Measurements
Authors:
Hanzi Huang,
Haoshuo Chen,
Qi Gao,
Yetian Huang,
Nicolas K. Fontaine,
Mikael Mazur,
Lauren Dallachiesa,
Roland Ryf,
Zhengxuan Li,
Yingxiong Song
Abstract:
We experimentally investigate transmitting high-order quadrature amplitude modulation (QAM) signals with carrierless and intensity-only measurements with phase retrieval (PR) receiving techniques. The intensity errors during measurement, including noise and distortions, are found to be a limiting factor for the precise convergence of the PR algorithm. To improve the PR reconstruction accuracy, we…
▽ More
We experimentally investigate transmitting high-order quadrature amplitude modulation (QAM) signals with carrierless and intensity-only measurements with phase retrieval (PR) receiving techniques. The intensity errors during measurement, including noise and distortions, are found to be a limiting factor for the precise convergence of the PR algorithm. To improve the PR reconstruction accuracy, we propose a distortion-aware PR scheme comprising both training and reconstruction stages. By estimating and emulating the distortion caused by various channel impairments, the proposed scheme enables enhanced agreement between the estimated and measured amplitudes throughout the PR iteration, thus resulting in improved reconstruction performance to support high-order QAM transmission. With the aid of proposed techniques, we experimentally demonstrate 50-GBaud 16QAM and 32QAM signals transmitting through a standard single-mode optical fiber (SSMF) span of 40 and 80 km, and achieve bit error rates (BERs) below the 6.25% hard decision (HD)-forward error correction (FEC) and 25% soft decision (SD)-FEC thresholds for the two modulation formats, respectively. By tuning the pilot symbol ratio and applying concatenated coding, we also demonstrate that a post-FEC data rate of up to 140 Gb/s can be achieved for both distances at an optimal pilot symbol ratio of 20%.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Safe Non-Stochastic Control of Control-Affine Systems: An Online Convex Optimization Approach
Authors:
Hongyu Zhou,
Yichen Song,
Vasileios Tzoumas
Abstract:
We study how to safely control nonlinear control-affine systems that are corrupted with bounded non-stochastic noise, i.e., noise that is unknown a priori and that is not necessarily governed by a stochastic model. We focus on safety constraints that take the form of time-varying convex constraints such as collision-avoidance and control-effort constraints. We provide an algorithm with bounded dyn…
▽ More
We study how to safely control nonlinear control-affine systems that are corrupted with bounded non-stochastic noise, i.e., noise that is unknown a priori and that is not necessarily governed by a stochastic model. We focus on safety constraints that take the form of time-varying convex constraints such as collision-avoidance and control-effort constraints. We provide an algorithm with bounded dynamic regret, i.e., bounded suboptimality against an optimal clairvoyant controller that knows the realization of the noise a prior. We are motivated by the future of autonomy where robots will autonomously perform complex tasks despite real-world unpredictable disturbances such as wind gusts. To develop the algorithm, we capture our problem as a sequential game between a controller and an adversary, where the controller plays first, choosing the control input, whereas the adversary plays second, choosing the noise's realization. The controller aims to minimize its cumulative tracking error despite being unable to know the noise's realization a prior. We validate our algorithm in simulated scenarios of (i) an inverted pendulum aiming to stay upright, and (ii) a quadrotor aiming to fly to a goal location through an unknown cluttered environment.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Authors:
Jiangliu Wang,
Jianbo Jiao,
Yibing Song,
Stephen James,
Zhan Tong,
Chongjian Ge,
Pieter Abbeel,
Yun-hui Liu
Abstract:
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-vi…
▽ More
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning
Authors:
Guanrou Yang,
Ziyang Ma,
Zhisheng Zheng,
Yakun Song,
Zhikang Niu,
Xie Chen
Abstract:
Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks. Various speech-based SSL models have been developed and present promising performance on a range of downstream tasks including speech recognition. However, existing speech-based SSL models face a common dilemma in terms of computational cost, which might hinder their potentia…
▽ More
Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks. Various speech-based SSL models have been developed and present promising performance on a range of downstream tasks including speech recognition. However, existing speech-based SSL models face a common dilemma in terms of computational cost, which might hinder their potential application and in-depth academic research. To address this issue, we first analyze the computational cost of different modules during HuBERT pre-training and then introduce a stack of efficiency optimizations, which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation, resulting in a 5.2x speedup, compared to the original implementation. Moreover, we explore two well-studied techniques in the Fast-HuBERT and demonstrate consistent improvements as reported in previous work.
△ Less
Submitted 29 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal
Authors:
Xiao Feng Zhang,
Tian Yi Song,
Jia Wei Yao
Abstract:
Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed F…
▽ More
Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.
△ Less
Submitted 2 January, 2024; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Preserving Tumor Volumes for Unsupervised Medical Image Registration
Authors:
Qihua Dong,
Hao Du,
Ying Song,
Yan Xu,
Jing Liao
Abstract:
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and und…
▽ More
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed strategy involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our strategy on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our codes is available at: \url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.
△ Less
Submitted 9 May, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.