-
Unsupervised and Interpretable Synthesizing for Electrical Time Series Based on Information Maximizing Generative Adversarial Nets
Authors:
Zhenghao Zhou,
Yiyan Li,
Runlong Liu,
Zheng Yan,
Mo-Yuen Chow
Abstract:
Generating synthetic data has become a popular alternative solution to deal with the difficulties in accessing and sharing field measurement data in power systems. However, to make the generation results controllable, existing methods (e.g. Conditional Generative Adversarial Nets, cGAN) require labeled dataset to train the model, which is demanding in practice because many field measurement data l…
▽ More
Generating synthetic data has become a popular alternative solution to deal with the difficulties in accessing and sharing field measurement data in power systems. However, to make the generation results controllable, existing methods (e.g. Conditional Generative Adversarial Nets, cGAN) require labeled dataset to train the model, which is demanding in practice because many field measurement data lacks descriptive labels. In this paper, we introduce the Information Maximizing Generative Adversarial Nets (infoGAN) to achieve interpretable feature extraction and controllable synthetic data generation based on the unlabeled electrical time series dataset. Features with clear physical meanings can be automatically extracted by maximizing the mutual information between the input latent code and the classifier output of infoGAN. Then the extracted features are used to control the generation results similar to a vanilla cGAN framework. Case study is based on the time series datasets of power load and renewable energy output. Results demonstrate that infoGAN can extract both discrete and continuous features with clear physical meanings, as well as generating realistic synthetic time series that satisfy given features.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
High Frequency Matters: Uncertainty Guided Image Compression with Wavelet Diffusion
Authors:
Juan Song,
Jiaxiang He,
Mingtao Feng,
Keyan Wang,
Yunsong Li,
Ajmal Mian
Abstract:
Diffusion probabilistic models have recently achieved remarkable success in generating high-quality images. However, balancing high perceptual quality and low distortion remains challenging in image compression applications. To address this issue, we propose an efficient Uncertainty-Guided image compression approach with wavelet Diffusion (UGDiff). Our approach focuses on high frequency compressio…
▽ More
Diffusion probabilistic models have recently achieved remarkable success in generating high-quality images. However, balancing high perceptual quality and low distortion remains challenging in image compression applications. To address this issue, we propose an efficient Uncertainty-Guided image compression approach with wavelet Diffusion (UGDiff). Our approach focuses on high frequency compression via the wavelet transform, since high frequency components are crucial for reconstructing image details. We introduce a wavelet conditional diffusion model for high frequency prediction, followed by a residual codec that compresses and transmits prediction residuals to the decoder. This diffusion prediction-then-residual compression paradigm effectively addresses the low fidelity issue common in direct reconstructions by existing diffusion models. Considering the uncertainty from the random sampling of the diffusion model, we further design an uncertainty-weighted rate-distortion (R-D) loss tailored for residual compression, providing a more rational trade-off between rate and distortion. Comprehensive experiments on two benchmark datasets validate the effectiveness of UGDiff, surpassing state-of-the-art image compression methods in R-D performance, perceptual quality, subjective quality, and inference time. Our code is available at: https://github.com/hejiaxiang1/Wavelet-Diffusion/tree/main
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024
Authors:
Ruibo Fu,
Rui Liu,
Chunyu Qiang,
Yingming Gao,
Yi Lu,
Tao Wang,
Ya Li,
Zhengqi Wen,
Chen Zhang,
Hui Bu,
Yukun Liu,
Shuchen Shi,
Xin Qi,
Guanjun Li
Abstract:
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective percept…
▽ More
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective perception in practical applications like companion robots for children and marketing bots. The core issue lies in the inconsistency between high-quality audio generation and the ultimate human subjective experience. Therefore, this challenge aims to enhance the persuasiveness and acceptability of synthesized audio, focusing on human alignment convincing and inspirational audio generation.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Machine Learning in Communications: A Road to Intelligent Transmission and Processing
Authors:
Shixiong Wang,
Geoffrey Ye Li
Abstract:
Prior to the era of artificial intelligence and big data, wireless communications primarily followed a conventional research route involving problem analysis, model building and calibration, algorithm design and tuning, and holistic and empirical verification. However, this methodology often encountered limitations when dealing with large-scale and complex problems and managing dynamic and massive…
▽ More
Prior to the era of artificial intelligence and big data, wireless communications primarily followed a conventional research route involving problem analysis, model building and calibration, algorithm design and tuning, and holistic and empirical verification. However, this methodology often encountered limitations when dealing with large-scale and complex problems and managing dynamic and massive data, resulting in inefficiencies and limited performance of traditional communication systems and methods. As such, wireless communications have embraced the revolutionary impact of artificial intelligence and machine learning, giving birth to more adaptive, efficient, and intelligent systems and algorithms. This technological shift opens a road to intelligent information transmission and processing. This overview article discusses the typical roles of machine learning in intelligent wireless communications, as well as its features, challenges, and practical considerations.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Uniformly Accelerated Motion Model for Inter Prediction
Authors:
Zhuoyuan Li,
Yao Li,
Chuanbo Tang,
Li Li,
Dong Liu,
Feng Wu
Abstract:
Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear…
▽ More
Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear models for motion estimation (ME) and motion compensation (MC), which may not well handle the complex motion fields in the real world. To address these issues, we introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames, and further combine them to assist the inter prediction methods to handle the variable motion in the temporal domain. Specifically, first, the theory of UAMM is mentioned. Second, based on that, we propose the UAMM-based parameter derivation and extrapolation schemes in the coding process. Third, we integrate the UAMM into existing inter prediction modes (Merge, MMVD, CIIP) to achieve higher prediction accuracy. The proposed method is implemented into the VVC reference software, VTM version 12.0. Experimental results show that the proposed method achieves up to 0.38% and on average 0.13% BD-rate reduction compared to the VTM anchor, under the Low-delay P configuration, with a slight increase of time complexity on the encoding/decoding side.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
In-Loop Filtering via Trained Look-Up Tables
Authors:
Zhuoyuan Li,
Jiacheng Li,
Yao Li,
Li Li,
Dong Liu,
Feng Wu
Abstract:
In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time…
▽ More
In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time and computational complexity, and high demands of high-performance hardware, which is challenging to apply to the general uses of coding scene. To address this limitation, inspired by explorations in image restoration, we propose an efficient and practical in-loop filtering scheme by adopting the Look-up Table (LUT). We train the DNN of in-loop filtering within a fixed filtering reference range, and cache the output values of the DNN into a LUT via traversing all possible inputs. At testing time in the coding process, the filtered pixel is generated by locating input pixels (to-be-filtered pixel with reference pixels) and interpolating cached filtered pixel values. To further enable the large filtering reference range with the limited storage cost of LUT, we introduce the enhanced indexing mechanism in the filtering process, and clipping/finetuning mechanism in the training. The proposed method is implemented into the Versatile Video Coding (VVC) reference software, VTM-11.0. Experimental results show that the ultrafast, very fast, and fast mode of the proposed method achieves on average 0.13%/0.34%/0.51%, and 0.10%/0.27%/0.39% BD-rate reduction, under the all intra (AI) and random access (RA) configurations. Especially, our method has friendly time and computational complexity, only 101%/102%-104%/108% time increase with 0.13-0.93 kMACs/pixel, and only 164-1148 KB storage cost for a single model. Our solution may shed light on the journey of practical neural network-based coding tool evolution.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
Authors:
Weizhi Liu,
Yue Li,
Dongdong Lin,
Hui Tian,
Haizhou Li
Abstract:
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, p…
▽ More
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
△ Less
Submitted 17 July, 2024; v1 submitted 15 July, 2024;
originally announced July 2024.
-
Distributed Charging Coordination for Electric Trucks under Limited Facilities and Travel Uncertainties
Authors:
Ting Bai,
Yuchao Li,
Karl Henrik Johansson,
Jonas Mårtensson
Abstract:
In this work, we address the problem of charging coordination between electric trucks and charging stations. The problem arises from the tension between the trucks' nontrivial charging times and the stations' limited charging facilities. Our goal is to reduce the trucks' waiting times at the stations while minimizing individual trucks' operational costs. We propose a distributed coordination frame…
▽ More
In this work, we address the problem of charging coordination between electric trucks and charging stations. The problem arises from the tension between the trucks' nontrivial charging times and the stations' limited charging facilities. Our goal is to reduce the trucks' waiting times at the stations while minimizing individual trucks' operational costs. We propose a distributed coordination framework that relies on computation and communication between the stations and the trucks, and handles uncertainties in travel times and energy consumption. Within the framework, the stations assign a limited number of charging ports to trucks according to the first-come, first-served rule. In addition, each station constructs a waiting time forecast model based on its historical data and provides its estimated waiting times to trucks upon request. When approaching a station, a truck sends its arrival time and estimated arrival-time windows to the nearby station and the distant stations, respectively. The truck then receives the estimated waiting times from these stations in response, and updates its charging plan accordingly while accounting for travel uncertainties. We performed simulation studies for $1,000$ trucks traversing the Swedish road network for $40$ days, using realistic traffic data with travel uncertainties. The results show that our method reduces the average waiting time of the trucks by $46.1\%$ compared to offline charging plans computed by the trucks without coordination and update, and by $33.8\%$ compared to the coordination scheme assuming zero waiting times at distant stations.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification
Authors:
Li Zhang,
Ning Jiang,
Qing Wang,
Yue Li,
Quan Lu,
Lei Xie
Abstract:
Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are…
▽ More
Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
Authors:
Xilin Jiang,
Yinghao Aaron Li,
Adrian Nicolas Florea,
Cong Han,
Nima Mesgarani
Abstract:
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar…
▽ More
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Deep Adversarial Defense Against Multilevel-Lp Attacks
Authors:
Ren Wang,
Yuxuan Li,
Alfred Hero
Abstract:
Deep learning models have shown considerable vulnerability to adversarial attacks, particularly as attacker strategies become more sophisticated. While traditional adversarial training (AT) techniques offer some resilience, they often focus on defending against a single type of attack, e.g., the $\ell_\infty$-norm attack, which can fail for other types. This paper introduces a computationally effi…
▽ More
Deep learning models have shown considerable vulnerability to adversarial attacks, particularly as attacker strategies become more sophisticated. While traditional adversarial training (AT) techniques offer some resilience, they often focus on defending against a single type of attack, e.g., the $\ell_\infty$-norm attack, which can fail for other types. This paper introduces a computationally efficient multilevel $\ell_p$ defense, called the Efficient Robust Mode Connectivity (EMRC) method, which aims to enhance a deep learning model's resilience against multiple $\ell_p$-norm attacks. Similar to analytical continuation approaches used in continuous optimization, the method blends two $p$-specific adversarially optimal models, the $\ell_1$- and $\ell_\infty$-norm AT solutions, to provide good adversarial robustness for a range of $p$. We present experiments demonstrating that our approach performs better on various attacks as compared to AT-$\ell_\infty$, E-AT, and MSD, for datasets/architectures including: CIFAR-10, CIFAR-100 / PreResNet110, WideResNet, ViT-Base.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Evolving Network Modeling Driven by the Degree Increase and Decrease Mechanism
Authors:
Yuhan Li,
Minyu Feng,
Jürgen Kurths
Abstract:
Ever since the Barabási-Albert (BA) scale-free network has been proposed, network modeling has been studied intensively in light of the network growth and the preferential attachment (PA). However, numerous real systems are featured with a dynamic evolution including network reduction in addition to network growth. In this paper, we propose a novel mechanism for evolving networks from the perspect…
▽ More
Ever since the Barabási-Albert (BA) scale-free network has been proposed, network modeling has been studied intensively in light of the network growth and the preferential attachment (PA). However, numerous real systems are featured with a dynamic evolution including network reduction in addition to network growth. In this paper, we propose a novel mechanism for evolving networks from the perspective of vertex degree. We construct a queueing system to describe the increase and decrease of vertex degree, which drives the network evolution. In our mechanism, the degree increase rate is regarded as a function positively correlated to the degree of a vertex, ensuring the preferential attachment in a new way. Degree distributions are investigated under two expressions of the degree increase rate, one of which manifests a ``long tail'', and another one varies with different values of parameters. In simulations, we compare our theoretical distributions with simulation results and also apply them to real networks, which presents the validity and applicability of our model.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Handling Distance Constraint in Movable Antenna Aided Systems: A General Optimization Framework
Authors:
Yichen Jin,
Qingfeng Lin,
Yang Li,
Yik-Chung Wu
Abstract:
The movable antenna (MA) is a promising technology to exploit more spatial degrees of freedom for enhancing wireless system performance. However, the MA-aided system introduces the non-convex antenna distance constraints, which poses challenges in the underlying optimization problems. To fill this gap, this paper proposes a general framework for optimizing the MA-aided system under the antenna dis…
▽ More
The movable antenna (MA) is a promising technology to exploit more spatial degrees of freedom for enhancing wireless system performance. However, the MA-aided system introduces the non-convex antenna distance constraints, which poses challenges in the underlying optimization problems. To fill this gap, this paper proposes a general framework for optimizing the MA-aided system under the antenna distance constraints. Specifically, we separate the non-convex antenna distance constraints from the objective function by introducing auxiliary variables. Then, the resulting problem can be efficiently solved under the alternating optimization framework. For the subproblems with respect to the antenna position variables and auxiliary variables, the proposed algorithms are able to obtain at least stationary points without any approximations. To verify the effectiveness of the proposed optimization framework, we present two case studies: capacity maximization and regularized zero-forcing precoding. Simulation results demonstrate the proposed optimization framework outperforms the existing baseline schemes under both cases.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Pilot-Based SFO Estimation for Bistatic Integrated Sensing and Communication
Authors:
Lucas Giroto de Oliveira,
Yueheng Li,
Silvio Mandelli,
David Brunner,
Marcus Henninger,
Xiang Wan,
Tie Jun Cui,
Thomas Zwick,
Benjamin Nuss
Abstract:
Enabling bistatic radar sensing within the context of integrated sensing and communication (ISAC) for future sixth generation mobile networks demands strict synchronization accuracy, which is particularly challenging to be achieved with over-the-air synchronization. Existing algorithms handle time and frequency offsets adequately, but provide insufficiently accurate sampling frequency offset (SFO)…
▽ More
Enabling bistatic radar sensing within the context of integrated sensing and communication (ISAC) for future sixth generation mobile networks demands strict synchronization accuracy, which is particularly challenging to be achieved with over-the-air synchronization. Existing algorithms handle time and frequency offsets adequately, but provide insufficiently accurate sampling frequency offset (SFO) estimates that result in degradation of obtained radar images in the form of signal-to-noise ratio loss and migration of range and Doppler shift. This article introduces an SFO estimation algorithm named tilt inference of time offset (TITO) for orthogonal frequency-division multiplexing (OFDM)-based ISAC. Using available pilot subcarriers, TITO obtains channel impulse response estimates and extracts information on the SFO-induced delay migration to a dominant reference path with constant range, Doppler shift, and angle between transmit and receive ISAC nodes. TITO then adaptively selects the delay estimates that are only negligibly impaired by SFO-induced intersymbol interference, ultimately employing them to estimate the SFO. Assuming a scenario without a direct line-of-sight (LoS) between the aforementioned transmitting and receiving ISAC nodes, a system concept with a relay reflective intelligent surface (RIS) is used to create the aforementioned reference path is proposed. Besides a mathematical derivation of accuracy bounds, simulation and measurements at 26.2 GHz are presented to demonstrate TITO's superiority over existing methods in terms of SFO estimation accuracy and robustness.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Trustworthy Contrast-enhanced Brain MRI Synthesis
Authors:
Jiyao Liu,
Yuxin Li,
Shangqi Gao,
Yuncheng Zhou,
Xin Gao,
Ningsheng Xu,
Xiao-Yong Zhang,
Xiahai Zhuang
Abstract:
Contrast-enhanced brain MRI (CE-MRI) is a valuable diagnostic technique but may pose health risks and incur high costs. To create safer alternatives, multi-modality medical image translation aims to synthesize CE-MRI images from other available modalities. Although existing methods can generate promising predictions, they still face two challenges, i.e., exhibiting over-confidence and lacking inte…
▽ More
Contrast-enhanced brain MRI (CE-MRI) is a valuable diagnostic technique but may pose health risks and incur high costs. To create safer alternatives, multi-modality medical image translation aims to synthesize CE-MRI images from other available modalities. Although existing methods can generate promising predictions, they still face two challenges, i.e., exhibiting over-confidence and lacking interpretability on predictions. To address the above challenges, this paper introduces TrustI2I, a novel trustworthy method that reformulates multi-to-one medical image translation problem as a multimodal regression problem, aiming to build an uncertainty-aware and reliable system. Specifically, our method leverages deep evidential regression to estimate prediction uncertainties and employs an explicit intermediate and late fusion strategy based on the Mixture of Normal Inverse Gamma (MoNIG) distribution, enhancing both synthesis quality and interpretability. Additionally, we incorporate uncertainty calibration to improve the reliability of uncertainty. Validation on the BraTS2018 dataset demonstrates that our approach surpasses current methods, producing higher-quality images with rational uncertainty estimation.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks
Authors:
Weisen Liu,
Zeqi Lai,
Qian Wu,
Hewu Li,
Qi Zhang,
Zonglun Li,
Yuanjie Li,
Jun Liu
Abstract:
With the rapid evolution of space-borne capabilities, space edge computing (SEC) is becoming a new computation paradigm for future integrated space and terrestrial networks. Satellite edges adopt advanced on-board hardware, which not only enables new opportunities to perform complex intelligent tasks in orbit, but also involves new challenges due to the additional energy consumption in power-const…
▽ More
With the rapid evolution of space-borne capabilities, space edge computing (SEC) is becoming a new computation paradigm for future integrated space and terrestrial networks. Satellite edges adopt advanced on-board hardware, which not only enables new opportunities to perform complex intelligent tasks in orbit, but also involves new challenges due to the additional energy consumption in power-constrained space environment. In this paper, we present PHOENIX, an energy-efficient task scheduling framework for emerging SEC networks. PHOENIX exploits a key insight that in the SEC network, there always exist a number of sunlit edges which are illuminated during the entire orbital period and have sufficient energy supplement from the sun. PHOENIX accomplishes energy-efficient in-orbit computing by judiciously offloading space tasks to "sunlight-sufficient" edges or to the ground. Specifically, PHOENIX first formulates the SEC battery energy optimizing (SBEO) problem which aims at minimizing the average battery energy consumption while satisfying various task completion constraints. Then PHOENIX incorporates a sunlight-aware scheduling mechanism to solve the SBEO problem and schedule SEC tasks efficiently. Finally, we implement a PHOENIX prototype and build an SEC testbed. Extensive data-driven evaluations demonstrate that as compared to other state-of-the-art solutions, PHOENIX can effectively reduce up to 54.8% SEC battery energy consumption and prolong battery lifetime to 2.9$\times$ while still completing tasks on time.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Variational Zero-shot Multispectral Pansharpening
Authors:
Xiangyu Rui,
Xiangyong Cao,
Yining Li,
Deyu Meng
Abstract:
Pansharpening aims to generate a high spatial resolution multispectral image (HRMS) by fusing a low spatial resolution multispectral image (LRMS) and a panchromatic image (PAN). The most challenging issue for this task is that only the to-be-fused LRMS and PAN are available, and the existing deep learning-based methods are unsuitable since they rely on many training pairs. Traditional variational…
▽ More
Pansharpening aims to generate a high spatial resolution multispectral image (HRMS) by fusing a low spatial resolution multispectral image (LRMS) and a panchromatic image (PAN). The most challenging issue for this task is that only the to-be-fused LRMS and PAN are available, and the existing deep learning-based methods are unsuitable since they rely on many training pairs. Traditional variational optimization (VO) based methods are well-suited for addressing such a problem. They focus on carefully designing explicit fusion rules as well as regularizations for an optimization problem, which are based on the researcher's discovery of the image relationships and image structures. Unlike previous VO-based methods, in this work, we explore such complex relationships by a parameterized term rather than a manually designed one. Specifically, we propose a zero-shot pansharpening method by introducing a neural network into the optimization objective. This network estimates a representation component of HRMS, which mainly describes the relationship between HRMS and PAN. In this way, the network achieves a similar goal to the so-called deep image prior because it implicitly regulates the relationship between the HRMS and PAN images through its inherent structure. We directly minimize this optimization objective via network parameters and the expected HRMS image through iterative updating. Extensive experiments on various benchmark datasets demonstrate that our proposed method can achieve better performance compared with other state-of-the-art methods. The codes are available at https://github.com/xyrui/PSDip.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
SCDM: Unified Representation Learning for EEG-to-fNIRS Cross-Modal Generation in MI-BCIs
Authors:
Yisheng Li,
Shuqiang Wang
Abstract:
Hybrid motor imagery brain-computer interfaces (MI-BCIs), which integrate both electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) signals, outperform those based solely on EEG. However, simultaneously recording EEG and fNIRS signals is highly challenging due to the difficulty of colocating both types of sensors on the same scalp surface. This physical constraint complic…
▽ More
Hybrid motor imagery brain-computer interfaces (MI-BCIs), which integrate both electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) signals, outperform those based solely on EEG. However, simultaneously recording EEG and fNIRS signals is highly challenging due to the difficulty of colocating both types of sensors on the same scalp surface. This physical constraint complicates the acquisition of high-quality hybrid signals, thereby limiting the widespread application of hybrid MI-BCIs. To facilitate the acquisition of hybrid EEG-fNIRS signals, this study proposes the spatio-temporal controlled diffusion model (SCDM) as a framework for cross-modal generation from EEG to fNIRS. The model utilizes two core modules, the spatial cross-modal generation (SCG) module and the multi-scale temporal representation (MTR) module, which adaptively learn the respective latent temporal and spatial representations of both signals in a unified representation space. The SCG module further maps EEG representations to fNIRS representations by leveraging their spatial relationships. Experimental results show high similarity between synthetic and real fNIRS signals. The joint classification performance of EEG and synthetic fNIRS signals is comparable to or even better than that of EEG with real fNIRS signals. Furthermore, the synthetic signals exhibit similar spatio-temporal features to real signals while preserving spatial relationships with EEG signals. Experimental results suggest that the SCDM may represent a promising paradigm for the acquisition of hybrid EEG-fNIRS signals in MI-BCI systems.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Segmenting Medical Images: From UNet to Res-UNet and nnUNet
Authors:
Lina Huang,
Alina Miron,
Kate Hone,
Yongmin Li
Abstract:
This study provides a comparative analysis of deep learning models including UNet, Res-UNet, Attention Res-UNet, and nnUNet, and evaluates their performance in brain tumour, polyp, and multi-class heart segmentation tasks. The analysis focuses on precision, accuracy, recall, Dice Similarity Coefficient (DSC), and Intersection over Union (IoU) to assess their clinical applicability. In brain tumour…
▽ More
This study provides a comparative analysis of deep learning models including UNet, Res-UNet, Attention Res-UNet, and nnUNet, and evaluates their performance in brain tumour, polyp, and multi-class heart segmentation tasks. The analysis focuses on precision, accuracy, recall, Dice Similarity Coefficient (DSC), and Intersection over Union (IoU) to assess their clinical applicability. In brain tumour segmentation, Res-UNet and nnUNet significantly outperformed UNet, with Res-UNet leading in DSC and IoU scores, indicating superior accuracy in tumour delineation. Meanwhile, nnUNet excelled in recall and accuracy, which are crucial for reliable tumour detection in clinical diagnosis and planning. In polyp detection, nnUNet was the most effective, achieving the highest metrics across all categories and proving itself as a reliable diagnostic tool in endoscopy. In the complex task of heart segmentation, Res-UNet and Attention Res-UNet were outstanding in delineating the left ventricle, with Res-UNet also leading in right ventricle segmentation. nnUNet was unmatched in myocardium segmentation, achieving top scores in precision, recall, DSC, and IoU. The conclusion notes that although Res-UNet occasionally outperforms nnUNet in specific metrics, the differences are quite small. Moreover, nnUNet consistently shows superior overall performance across the experiments. Particularly noted for its high recall and accuracy, which are crucial in clinical settings to minimize misdiagnosis and ensure timely treatment, nnUNet's robust performance in crucial metrics across all tested categories establishes it as the most effective model for these varied and complex segmentation tasks.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Gemini: Integrating Full-fledged Sensing upon Millimeter Wave Communications
Authors:
Yilong Li,
Zhe Chen,
Jun Luo,
Suman Banerjee
Abstract:
Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper…
▽ More
Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper, we design and implement a full-fledged mmWave ISAC system Gemini; it delivers raw channel states to serve a broad category of sensing applications. We first propose the mmWave self-interference cancellation approach to extract the weak reflected signals for near-field sensing purposes. Then, we develop a joint optimization scheduling framework that can be utilized in accurate radar sensing while maximizing the communication throughput. Finally, we design a united fusion sensing algorithm to offer a better sensing performance via combining monostatic and bistatic modes. We evaluate our system in extensive experiments to demonstrate Gemini's capability of simultaneously operating sensing and communication, enabling mmWave ISAC to perform better than the commercial off-the-shelf mmWave radar for 5G cellular networks.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Authors:
Keyu An,
Qian Chen,
Chong Deng,
Zhihao Du,
Changfeng Gao,
Zhifu Gao,
Yue Gu,
Ting He,
Hangrui Hu,
Kai Hu,
Shengpeng Ji,
Yabin Li,
Zerui Li,
Heng Lu,
Haoneng Luo,
Xiang Lv,
Bin Ma,
Ziyang Ma,
Chongjia Ni,
Changhe Song,
Jiaqi Shi,
Xian Shi,
Hao Wang,
Wen Wang,
Yuxuan Wang
, et al. (8 additional authors not shown)
Abstract:
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp…
▽ More
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.
△ Less
Submitted 10 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Spatio-temporal cooperative control Method of Highway Ramp Merge Based on Vehicle-road Coordination
Authors:
Xiaoxue Xu,
Maokai Lai,
Haitao Zhang,
Xiang Dong,
Tao Li,
Jie Wu,
Yuan Li,
Ting Peng
Abstract:
The merging area of highway ramps faces multiple challenges, including traffic congestion, collision risks, speed mismatches, driver behavior uncertainties, limited visibility, and bottleneck effects. However, autonomous vehicles engaging in depth coordination between vehicle and road in merging zones, by pre-planning and uploading travel trajectories, can significantly enhance the safety and effi…
▽ More
The merging area of highway ramps faces multiple challenges, including traffic congestion, collision risks, speed mismatches, driver behavior uncertainties, limited visibility, and bottleneck effects. However, autonomous vehicles engaging in depth coordination between vehicle and road in merging zones, by pre-planning and uploading travel trajectories, can significantly enhance the safety and efficiency of merging zones.In this paper,we mainly introduce mainline priority cooperation method to achieve the time and space cooperative control of highway merge.Vehicle-mounted intelligent units share real-time vehicle status and driving intentions with Road Section Management Units, which pre-plan the spatiotemporal trajectories of vehicle travel. After receiving these trajectories, Vehicle Intelligent Units strictly adhere to them. Through this deep collaboration between vehicles and roads, conflicts in time and space during vehicle travel are eliminated in advance.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Protection Degree and Migration in the Stochastic SIRS Model: A Queueing System Perspective
Authors:
Yuhan Li,
Ziyan Zeng,
Minyu Feng,
Jürgen Kurths
Abstract:
With the prevalence of COVID-19, the modeling of epidemic propagation and its analyses have played a significant role in controlling epidemics. However, individual behaviors, in particular the self-protection and migration, which have a strong influence on epidemic propagation, were always neglected in previous studies. In this paper, we mainly propose two models from the individual and population…
▽ More
With the prevalence of COVID-19, the modeling of epidemic propagation and its analyses have played a significant role in controlling epidemics. However, individual behaviors, in particular the self-protection and migration, which have a strong influence on epidemic propagation, were always neglected in previous studies. In this paper, we mainly propose two models from the individual and population perspectives. In the first individual model, we introduce the individual protection degree that effectively suppresses the epidemic level as a stochastic variable to the SIRS model. In the alternative population model, an open Markov queueing network is constructed to investigate the individual number of each epidemic state, and we present an evolving population network via the migration of people. Besides, stochastic methods are applied to analyze both models. In various simulations, the infected probability, the number of individuals in each state and its limited distribution are demonstrated.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
White-Box 3D-OMP-Transformer for ISAC
Authors:
Bowen Zhang,
Geoffrey Ye Li
Abstract:
Transformers have found broad applications for their great ability to capture long-range dependency among the inputs using attention mechanisms. The recent success of transformers increases the need for mathematical interpretation of their underlying working mechanisms, leading to the development of a family of white-box transformer-like deep network architectures. However, designing white-box tra…
▽ More
Transformers have found broad applications for their great ability to capture long-range dependency among the inputs using attention mechanisms. The recent success of transformers increases the need for mathematical interpretation of their underlying working mechanisms, leading to the development of a family of white-box transformer-like deep network architectures. However, designing white-box transformers with efficient three-dimensional (3D) attention is still an open challenge. In this work, we revisit the 3D-orthogonal matching pursuit (OMP) algorithm and demonstrate that the operation of 3D-OMP is analogous to a specific kind of transformer with 3D attention. Therefore, we build a white-box 3D-OMP-transformer by introducing additional learnable parameters to 3D-OMP. As a transformer, its 3D-attention can be mathematically interpreted from 3D-OMP; while as a variant of OMP, it can learn to improve the matching pursuit process from data. Besides, a transformer's performance can be improved by stacking more transformer blocks. To simulate this process, we design a cascaded 3D-OMP-Transformer with dynamic small-scale dictionaries, which can improve the performance of the 3D-OMP-Transformer with low costs. We evaluate the designed 3D-OMP-transformer in the multi-target detection task of integrated sensing and communications (ISAC). Experimental results show that the designed 3D-OMP-Transformer can outperform current baselines.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
The USTC-NERCSLIP Systems for The ICMC-ASR Challenge
Authors:
Minghui Wu,
Luzhen Xu,
Jie Zhang,
Haitao Tang,
Yanyan Yue,
Ruizhi Liao,
Jintao Zhao,
Zhengzhe Zhang,
Yichi Wang,
Haoyin Yan,
Hongliang Yu,
Tongle Ma,
Jiachen Liu,
Chongliang Wu,
Yongchao Li,
Yanyong Zhang,
Xin Fang,
Yue Zhang
Abstract:
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,…
▽ More
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position, respectively. For ASR, we employ an iterative pseudo-label generation method based on fusion model to obtain text labels of unsupervised data. To mitigate the impact of accent, an Accent-ASR framework is proposed, which captures pronunciation-related accent features at a fine-grained level and linguistic information at a coarse-grained level. On the ICMC-ASR eval set, the proposed system achieves a CER of 13.16% on track 1 and a cpCER of 21.48% on track 2, which significantly outperforms the official baseline system and obtains the first rank on both tracks.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Efficient Stochastic Differential Equation for DEM Super Resolution with Void Filling
Authors:
Tongtong Zhang,
Zongcheng Zuo,
Yuanxiang Li
Abstract:
Digital Elevation Model (DEM) plays a fundamental role in remote sensing and photogrammetry. Enhancing the quality of DEM is crucial for various applications. Although multiple types of defects may appear simultaneously in the same DEM, they are commonly addressed separately. Most existing approaches only aim to fill the DEM voids, or apply super-resolution to the intact DEM. This paper introduces…
▽ More
Digital Elevation Model (DEM) plays a fundamental role in remote sensing and photogrammetry. Enhancing the quality of DEM is crucial for various applications. Although multiple types of defects may appear simultaneously in the same DEM, they are commonly addressed separately. Most existing approaches only aim to fill the DEM voids, or apply super-resolution to the intact DEM. This paper introduces a unified generative model that simultaneously addresses voids and low-resolution problems, rather than taking two separate measures. The proposed approach presents the DEM Stochastic Differential Equation (DEM-SDE) for unified DEM quality enhancement. The DEM degradation of downsampling and random voids adding is modeled as the SDE forwarding, and the restoration is achieved by simulating the corresponding revert process. Conditioned on the terrain feature, and adopting efficient submodules with lightweight channel attention, DEM-SDE simultaneously enhances the DEM quality with an efficient process for training. The experiments show that DEM-SDE method achieves highly competitive performance in simultaneous super-resolution and void filling compared to the state-of-the-art work. DEM-SDE also manifests robustness for larger DEM patches.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Channel Modeling Aided Dataset Generation for AI-Enabled CSI Feedback: Advances, Challenges, and Solutions
Authors:
Yupeng Li,
Gang Li,
Zirui Wen,
Shuangfeng Han,
Shijian Gao,
Guangyi Liu,
Jiangzhou Wang
Abstract:
The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation metho…
▽ More
The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation method based on a limited number of field channel data. Specifically, the user equipment (UE) extracts the primary stochastic parameters of the field channel data and transmits them to the base station (BS). The BS then updates the typical TR 38.901 model parameters with the extracted parameters. In this way, the updated channel model is used to generate the dataset. This strategy comprehensively considers the dataset collection, model generalization, model monitoring, and so on. Simulations verify that our proposed strategy can significantly improve performance compared to the benchmarks.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
IVCA: Inter-Relation-Aware Video Complexity Analyzer
Authors:
Junqi Liao,
Yao Li,
Zhuoyuan Li,
Li Li,
Dong Liu
Abstract:
To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA…
▽ More
To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA. Next, drawing inspiration from the hierarchical reference structure in codecs, we design layer-aware weights to adjust the majorities of frame complexity in different layers. Additionally, we expand the scope of temporal features by considering frames that be referred to, rather than relying solely on the previous frame. Experimental results show the significant improvement in complexity estimation accuracy achieved by IVCA, with minimal time complexity increase.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Localization in Multipath Environments via Active Sensing with Reconfigurable Intelligent Surfaces
Authors:
Yinghan Li,
Wei Yu
Abstract:
This letter investigates an uplink pilot-based wireless indoor localization problem in a multipath environment for a single-input single-output (SISO) narrowband communication system aided by reconfigurable intelligent surface (RIS). The indoor localization problem is challenging because the uplink channel consists of multiple overlapping propagation paths with varying amplitudes and phases, which…
▽ More
This letter investigates an uplink pilot-based wireless indoor localization problem in a multipath environment for a single-input single-output (SISO) narrowband communication system aided by reconfigurable intelligent surface (RIS). The indoor localization problem is challenging because the uplink channel consists of multiple overlapping propagation paths with varying amplitudes and phases, which are not easy to differentiate. This letter proposes the use of RIS capable of adaptively changing its reflection pattern to sense such a multiple-path environment. Toward this end, we train a long-short-term-memory (LSTM) based controller to perform adaptive sequential reconfigurations of the RIS over multiple stages and propose to group multiple pilots as input in each stage. Information from the multiple paths is captured by training the LSTM to generate multiple RIS configurations to align to the different paths within each stage. Experimental results show that the proposed approach is effective in significantly reducing training complexity while maintaining localization performance at fixed number of pilots.
△ Less
Submitted 8 July, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Authors:
Zi Wang,
Fanwen Wang,
Chen Qin,
Jun Lyu,
Ouyang Cheng,
Shuo Wang,
Yan Li,
Mengyao Yu,
Haoyu Zhang,
Kunyuan Guo,
Zhang Shi,
Qirong Li,
Ziqiang Xu,
Yajing Zhang,
Hao Li,
Sha Hua,
Binghua Chen,
Longyu Sun,
Mengting Sun,
Qin Li,
Ying-Hua Chu,
Wenjia Bai,
Jing Qin,
Xiahai Zhuang,
Claudia Prieto
, et al. (7 additional authors not shown)
Abstract:
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h…
▽ More
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition
Authors:
Yi Ding,
Chengxuan Tong,
Shuailei Zhang,
Muyun Jiang,
Yong Li,
Kevin Lim Jun Liang,
Cuntai Guan
Abstract:
Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a…
▽ More
Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a novel transformer model called emotion transformer (EmT). EmT is designed to excel in both generalized cross-subject EEG emotion classification and regression tasks. In EmT, EEG signals are transformed into a temporal graph format, creating a sequence of EEG feature graphs using a temporal graph construction module (TGC). A novel residual multi-view pyramid GCN module (RMPG) is then proposed to learn dynamic graph representations for each EEG feature graph within the series, and the learned representations of each graph are fused into one token. Furthermore, we design a temporal contextual transformer module (TCT) with two types of token mixers to learn the temporal contextual information. Finally, the task-specific output module (TSO) generates the desired outputs. Experiments on four publicly available datasets show that EmT achieves higher results than the baseline methods for both EEG emotion classification and regression tasks. The code is available at https://github.com/yi-ding-cs/EmT.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Authors:
Zengding Liu,
Chen Chen,
Jiannong Cao,
Minglei Pan,
Jikui Liu,
Nan Li,
Fen Miao,
Ye Li
Abstract:
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood press…
▽ More
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through fine-tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
△ Less
Submitted 4 July, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
Authors:
Yingting Li,
Ambuj Mehrish,
Bryan Chew,
Bo Cheng,
Soujanya Poria
Abstract:
Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to…
▽ More
Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Experimental Validation of Cooperative RSS-based Localization with Unknown Transmit Power, Path Loss Exponent, and Precise Anchor Location
Authors:
Yingquan Li,
Bodhibrata Mukhopadhyay,
Jiajie Xu,
Mohamed-Slim Alouini
Abstract:
Received signal strength (RSS)--based cooperative localization has gained significant attention due to its straightforward system architectures and cost-effectiveness. In this paper, we propose Cooperative Localization Techniques (with Unknown Parameters), referred to as CTUP(s), which consider uncertainty in anchor nodes' locations and assume the transmit power and \textcolor{blue}{path loss expo…
▽ More
Received signal strength (RSS)--based cooperative localization has gained significant attention due to its straightforward system architectures and cost-effectiveness. In this paper, we propose Cooperative Localization Techniques (with Unknown Parameters), referred to as CTUP(s), which consider uncertainty in anchor nodes' locations and assume the transmit power and \textcolor{blue}{path loss exponent (PLE)} to be unknown. Unlike prior studies, CTUP(s) address unknowns by estimating these parameters, along with the location of target nodes. The non-convex and non-linear nature of the maximum likelihood (ML) estimator of the problem is addressed through relaxation techniques, employing Taylor series expansion, semidefinite relaxation (SDR), and the epigraph method. The resulting problem is solved using semidefinite second-order cone programming (SDP-SOCP), leveraging the precision of SDP and the simplicity of SOCP. We deployed an extensive network comprising 50 BLE nodes covering an area of 640~m $\times$ 180~m to gather RSS data. The precise location of the nodes is obtained using real-time kinematics global positioning system (RTK-GPS), which is treated as the ground truth. Furthermore, to replicate real-world scenarios, we recorded the positions of the anchor nodes using a standard GPS, thereby introducing uncertainty into the anchor node locations. Extensive simulation and hardware experimentation demonstrate the superior performance of CTUP compared to existing techniques.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy
Authors:
Long Bai,
Tong Chen,
Qiaozhi Tan,
Wan Jun Nah,
Yanheng Li,
Zhicheng He,
Sishen Yuan,
Zhen Chen,
Jinlin Wu,
Mobarakol Islam,
Zhen Li,
Hongbin Liu,
Hongliang Ren
Abstract:
Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels rema…
▽ More
Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DiT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DiT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.
△ Less
Submitted 8 July, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition
Authors:
Kuan-Chen Wang,
You-Jin Li,
Wei-Lun Chen,
Yu-Wen Chen,
Yi-Ching Wang,
Ping-Cheng Yeh,
Chao Zhang,
Yu Tsao
Abstract:
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study intro…
▽ More
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. A bridge module, which is a lightweight NN, is proposed to evaluate the signal-level information of the speech signal. Subsequently, using the signal-level information, the observation addition technique is applied to effectively reduce the shortcomings of SE. The experimental results demonstrate the success of our method in integrating diverse pre-trained SE and ASR models, considerably boosting the ASR robustness. Crucially, no prior knowledge of the ASR or speech contents is required during the training or inference stages. Moreover, the effectiveness of this approach extends to different datasets without necessitating the fine-tuning of the bridge module, ensuring efficiency and improved generalization.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation
Authors:
Qin Li,
Yizhe Zhang,
Yan Li,
Jun Lyu,
Meng Liu,
Longyu Sun,
Mengting Sun,
Qirong Li,
Wenyue Mao,
Xinran Wu,
Yajing Zhang,
Yinghua Chu,
Shuo Wang,
Chengyan Wang
Abstract:
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potenti…
▽ More
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
A Swift and Omnidirectional Formation Approach based on Hierarchical Reorganization
Authors:
Yuzhu Li,
Wei Dong
Abstract:
Current formations commonly rely on invariant hierarchical structures, such as predetermined leaders or enumerated formation shapes. These structures could be unidirectional and sluggish, constraining their adaptability and agility when encountering cluttered environments. To surmount these constraints, this work proposes an omnidirectional affine formation approach with hierarchical reorganizatio…
▽ More
Current formations commonly rely on invariant hierarchical structures, such as predetermined leaders or enumerated formation shapes. These structures could be unidirectional and sluggish, constraining their adaptability and agility when encountering cluttered environments. To surmount these constraints, this work proposes an omnidirectional affine formation approach with hierarchical reorganizations. We first delineate the critical conditions requisite for facilitating hierarchical reorganizations within formations, which informs the development of the omnidirectional affine criterion. Central to our approach is the fluid leadership and authority redistribution, for which we develop a minimum time-driven leadership evaluation algorithm and a power transition control algorithm. These algorithms facilitate autonomous leader selection and ensure smooth power transitions, enabling the swarm to adapt hierarchically in alignment with the external environment. Furthermore, we deploy a power-centric topology switching mechanism tailored for the dynamic reorganization of in-team connections. Finally, simulations and experiments demonstrate the performance of the proposed method. The formation successfully performs several hierarchical reorganizations, with the longest reorganization taking only 0.047s. This swift adaptability allows five aerial robots to carry out complex tasks, including executing swerving movements and navigating through hoops at velocities up to 1.9m/s.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Enhancing Resilience of Power Systems against Typhoon Threats: A Hybrid Data-Model Driven Approach
Authors:
Yang Li
Abstract:
This chapter addresses the increasing vulnerability of coastal regions to typhoons and the consequent power outages, emphasizing the critical role of power transmission systems in disaster resilience. It introduces a framework for assessing and enhancing the resilience of these systems against typhoon impacts. The approach integrates a hybrid-driven model for system failure analysis and resilience…
▽ More
This chapter addresses the increasing vulnerability of coastal regions to typhoons and the consequent power outages, emphasizing the critical role of power transmission systems in disaster resilience. It introduces a framework for assessing and enhancing the resilience of these systems against typhoon impacts. The approach integrates a hybrid-driven model for system failure analysis and resilience assessment, employing both data-driven and model-driven techniques. It includes a unique method to identify system vulnerabilities and optimal strategies for resilience enhancement, considering cost-effectiveness. The efficacy of this method is demonstrated through simulations on the IEEE RTS-79 system under realistic typhoon scenarios, showcasing its potential to guide planners in making informed decisions for disaster resilience.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Lightening Anything in Medical Images
Authors:
Ben Fei,
Yixuan Li,
Weidong Yang,
Hengjun Gao,
Jingyi Xu,
Lipeng Ma,
Yatian Yang,
Pinghong Zhou
Abstract:
The development of medical imaging techniques has made a significant contribution to clinical decision-making. However, the existence of suboptimal imaging quality, as indicated by irregular illumination or imbalanced intensity, presents significant obstacles in automating disease screening, analysis, and diagnosis. Existing approaches for natural image enhancement are mostly trained with numerous…
▽ More
The development of medical imaging techniques has made a significant contribution to clinical decision-making. However, the existence of suboptimal imaging quality, as indicated by irregular illumination or imbalanced intensity, presents significant obstacles in automating disease screening, analysis, and diagnosis. Existing approaches for natural image enhancement are mostly trained with numerous paired images, presenting challenges in data collection and training costs, all while lacking the ability to generalize effectively. Here, we introduce a pioneering training-free Diffusion Model for Universal Medical Image Enhancement, named UniMIE. UniMIE demonstrates its unsupervised enhancement capabilities across various medical image modalities without the need for any fine-tuning. It accomplishes this by relying solely on a single pre-trained model from ImageNet. We conduct a comprehensive evaluation on 13 imaging modalities and over 15 medical types, demonstrating better qualities, robustness, and accuracy than other modality-specific and data-inefficient models. By delivering high-quality enhancement and corresponding accuracy downstream tasks across a wide range of tasks, UniMIE exhibits considerable potential to accelerate the advancement of diagnostic tools and customized treatment plans.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Machine learning-based Near-field Emitter Localization via Grouped Hybrid Analog and Digital Massive MIMO Receive Array
Authors:
Yifan Li,
Feng Shu,
Jiatong Bai,
Cunhua Pan,
Yongpeng Wu,
Yaoliang Song,
Jiangzhou Wang
Abstract:
A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF…
▽ More
A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF DOA estimation, which divides the large-scale receive array into small-scale groups and each group contains several subarrays. Thus the NF direction-of-arrival (DOA) estimation problem is viewed as far-field (FF) within each group, and some existing methods such as MUSIC, Root-MUSIC, ESPRIT, etc., can be adopted. Then by angle calibration, a candidate position set is generated. To eliminate the phase ambiguity arising from the HAD structure and obtain the emitter position, two low-complexity clustering-based methods, minimum sample distance clustering (MSDC) and range scatter diagram (RSD) - angle scatter diagram (ASD)-based DBSCAN (RSD-ASD-DBSCAN), are proposed based on the distribution features of samples in the candidate position set. Then to further improve the localization accuracy, a model-driven regression network (RegNet) is designed, which consists of a multi-layer neural network (MLNN) for false solution elimination and a perceptron for angle fusion. Finally, the Cramer-Rao lower bound (CRLB) of NF emitter localization for the proposed grouped HAD structure is also derived. The simulation results show that the proposed methods can achieve CRLB at different SNR regions, the RegNet has great performance advantages at low SNR regions and the clustering-based methods have much lower complexity.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Near-Field Multiuser Communications based on Sparse Arrays
Authors:
Kangjian Chen,
Chenhao Qi,
Geoffrey Ye Li,
Octavia A. Dobre
Abstract:
This paper considers near-field multiuser communications based on sparse arrays (SAs). First, for the uniform SAs (USAs), we analyze the beam gains of channel steering vectors, which shows that increasing the antenna spacings can effectively improve the spatial resolution of the antenna arrays to enhance the sum rate of multiuser communications. Then, we investigate nonuniform SAs (NSAs) to mitiga…
▽ More
This paper considers near-field multiuser communications based on sparse arrays (SAs). First, for the uniform SAs (USAs), we analyze the beam gains of channel steering vectors, which shows that increasing the antenna spacings can effectively improve the spatial resolution of the antenna arrays to enhance the sum rate of multiuser communications. Then, we investigate nonuniform SAs (NSAs) to mitigate the high multiuser interference from the grating lobes of the USAs. To maximize the sum rate of near-field multiuser communications, we optimize the antenna positions of the NSAs, where a successive convex approximation-based antenna position optimization algorithm is proposed. Moreover, we find that the channels of both the USAs and the NSAs show uniform sparsity in the defined surrogate distance-angle (SD-A) domain. Based on the channel sparsity, an on-grid SD-A-domain orthogonal matching pursuit (SDA-OMP) algorithm is developed to estimate multiuser channels. To further improve the resolution of the SDA-OMP, we also design an off-grid SD-A-domain iterative super-resolution channel estimation algorithm. Simulation results demonstrate the superior performance of the proposed methods.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
Authors:
Yue Li,
Xinsheng Wang,
Li Zhang,
Lei Xie
Abstract:
Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav…
▽ More
Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Authors:
Yuanchao Li,
Peter Bell,
Catherine Lai
Abstract:
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE…
▽ More
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
Authors:
Yongjie Si,
Yanxiong Li,
Jialong Li,
Jiaxin Tan,
Qianhua He
Abstract:
It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessi…
▽ More
It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessions. Moreover, we propose a method using expandable dual-embedding extractor to solve it. The proposed model consists of an embedding extractor and an expandable classifier. The embedding extractor consists of a pretrained Audio Spectrogram Transformer (AST) and a finetuned AST. The expandable classifier consists of prototypes and each prototype represents a class. Experiments are conducted on three datasets (LS-100, NSynth-100 and FSC-89). Results show that our method exceeds seven baseline ones in average accuracy with statistical significance. Code is at: https://github.com/YongjieSi/EDE.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network
Authors:
Yanxiong Li,
Jiaxin Tan,
Guoqing Chen,
Jialong Li,
Yongjie Si,
Qianhua He
Abstract:
This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextu…
▽ More
This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: https://github.com/Jessytan/Low-complexity-ASC.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Authors:
Yi Lu,
Yuankun Xie,
Ruibo Fu,
Zhengqi Wen,
Jianhua Tao,
Zhiyong Wang,
Xin Qi,
Xuefei Liu,
Yongwei Li,
Yukun Liu,
Xiaopeng Wang,
Shuchen Shi
Abstract:
With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to…
▽ More
With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skipping the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model
Authors:
Hang Fu,
Genyun Sun,
Yinhe Li,
Jinchang Ren,
Aizhu Zhang,
Cheng Jing,
Pedram Ghamisi
Abstract:
Haze contamination in hyperspectral remote sensing images (HSI) can lead to spatial visibility degradation and spectral distortion. Haze in HSI exhibits spatial irregularity and inhomogeneous spectral distribution, with few dehazing networks available. Current CNN and Transformer-based dehazing methods fail to balance global scene recovery, local detail retention, and computational efficiency. Ins…
▽ More
Haze contamination in hyperspectral remote sensing images (HSI) can lead to spatial visibility degradation and spectral distortion. Haze in HSI exhibits spatial irregularity and inhomogeneous spectral distribution, with few dehazing networks available. Current CNN and Transformer-based dehazing methods fail to balance global scene recovery, local detail retention, and computational efficiency. Inspired by the ability of Mamba to model long-range dependencies with linear complexity, we explore its potential for HSI dehazing and propose the first HSI Dehazing Mamba (HDMba) network. Specifically, we design a novel window selective scan module (WSSM) that captures local dependencies within windows and global correlations between windows by partitioning them. This approach improves the ability of conventional Mamba in local feature extraction. By modeling the local and global spectral-spatial information flow, we achieve a comprehensive analysis of hazy regions. The DehazeMamba layer (DML), constructed by WSSM, and residual DehazeMamba (RDM) blocks, composed of DMLs, are the core components of the HDMba framework. These components effectively characterize the complex distribution of haze in HSIs, aiding in scene reconstruction and dehazing. Experimental results on the Gaofen-5 HSI dataset demonstrate that HDMba outperforms other state-of-the-art methods in dehazing performance. The code will be available at https://github.com/RsAI-lab/HDMba.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion
Authors:
Bingsong Bai,
Fengping Wang,
Yingming Gao,
Ya Li
Abstract:
Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we prop…
▽ More
Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Near-Field Channel Estimation for Extremely Large-Scale Terahertz Communications
Authors:
Songjie Yang,
Yizhou Peng,
Wanting Lyu,
Ya Li,
Hongjun He,
Zhongpei Zhang,
Chau Yuen
Abstract:
Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to…
▽ More
Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to enhance efficiency in Terahertz widely-spaced multi-subarray (WSMS) systems. Initially, we introduce three sparse channel representation methods: polar-domain representation (PD-R), multi-angular-domain representation (MAD-R), and two-dimensional polar-angular-domain representation (2D-PAD-R). Each method is meticulously developed for near-field WSMS channels, capitalizing on their sparsity characteristics. Building on this, we propose four estimation frameworks using the sparse recovery theory: polar-domain estimation (PD-E), multi-angular-domain estimation (MAD-E), two-stage polar-angular-domain estimation (TS-PAD-E), and two-dimensional polar-angular-domain estimation (2D-PAD-E). Particularly, 2D-PAD-E, integrating a 2D dictionary process, and TS-PAD-E, with its sequential approach to angle and distance estimation, stand out as particularly effective for near-field angle-distance estimation, enabling decoupled calculation of these parameters. Overall, these frameworks provide versatile and efficient solutions for WSMS channel estimation, balancing low complexity with high-performance outcomes. Additionally, they represent a fresh perspective on near-field signal processing.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.