-
HDN:Hybrid Deep-learning and Non-line-of-sight Reconstruction Framework for Photoacoustic Brain Imaging
Authors:
Pengcheng Wan,
Fan Zhang,
Yuting Shen,
Xin Shang,
Hulin Zhao,
Shuangli Liu,
Xiaohua Feng,
Fei Gao
Abstract:
Photoacoustic imaging (PAI) combines the high contrast of optical imaging with the deep penetration depth of ultrasonic imaging, showing great potential in cerebrovascular disease detection. However, the ultrasonic wave suffers strong attenuation and multi-scattering when it passes through the skull tissue, resulting in the distortion of the collected photoacoustic (PA) signal. In this paper, insp…
▽ More
Photoacoustic imaging (PAI) combines the high contrast of optical imaging with the deep penetration depth of ultrasonic imaging, showing great potential in cerebrovascular disease detection. However, the ultrasonic wave suffers strong attenuation and multi-scattering when it passes through the skull tissue, resulting in the distortion of the collected photoacoustic (PA) signal. In this paper, inspired by the principles of deep learning and non-line-of-sight (NLOS) imaging, we propose an image reconstruction framework named HDN (Hybrid Deep-learning and Non-line-of-sight), which consists of the signal extraction part and difference utilization part. The signal extraction part is used to correct the distorted signal and reconstruct an initial image. The difference utilization part is used to make further use of the signal difference between the distorted signal and corrected signal, reconstructing the residual image between the initial image and the target image. The test results on a PA digital brain simulation dataset show that compared with the traditional delay-and-sum (DAS) method and deep-learning-based method, HDN achieved superior performance in both signal correction and image reconstruction. Specifically for the SSIM index, the HDN reached 0.606 in imaging results, compared to 0.154 for the DAS method and 0.307 for the deep-learning-based method.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
GAIA -- A Large Language Model for Advanced Power Dispatch
Authors:
Yuheng Cheng,
Huan Zhao,
Xiyuan Zhou,
Junhua Zhao,
Yuji Cao,
Chao Yang
Abstract:
Power dispatch is essential for providing stable, cost-effective, and eco-friendly electricity to society. However, traditional methods falter as power systems grow in scale and complexity, struggling with multitasking, swift problem-solving, and human-machine collaboration. This paper introduces GAIA, the pioneering Large Language Model (LLM) tailored for power dispatch tasks. We have developed a…
▽ More
Power dispatch is essential for providing stable, cost-effective, and eco-friendly electricity to society. However, traditional methods falter as power systems grow in scale and complexity, struggling with multitasking, swift problem-solving, and human-machine collaboration. This paper introduces GAIA, the pioneering Large Language Model (LLM) tailored for power dispatch tasks. We have developed a novel dataset construction technique that harnesses a range of data sources to fine-tune GAIA for optimal performance in this domain. This approach streamlines LLM training, allowing for the seamless integration of multidimensional data in power system management. Additionally, we have crafted specialized prompt strategies to boost GAIA's input-output efficiency in dispatch scenarios. When evaluated on the ElecBench benchmark, GAIA surpasses the baseline model LLaMA2 on multiple metrics. In practical applications, GAIA has demonstrated its ability to enhance decision-making processes, improve operational efficiency, and facilitate better human-machine interactions in power dispatch operations. This paper expands the application of LLMs to power dispatch and validates their practical utility, paving the way for future innovations in this field.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Semantics Guided Disentangled GAN for Chest X-ray Image Rib Segmentation
Authors:
Lili Huang,
Dexin Ma,
Xiaowei Zhao,
Chenglong Li,
Haifeng Zhao,
Jin Tang,
Chuanfu Li
Abstract:
The label annotations for chest X-ray image rib segmentation are time consuming and laborious, and the labeling quality heavily relies on medical knowledge of annotators. To reduce the dependency on annotated data, existing works often utilize generative adversarial network (GAN) to generate training data. However, GAN-based methods overlook the nuanced information specific to individual organs, w…
▽ More
The label annotations for chest X-ray image rib segmentation are time consuming and laborious, and the labeling quality heavily relies on medical knowledge of annotators. To reduce the dependency on annotated data, existing works often utilize generative adversarial network (GAN) to generate training data. However, GAN-based methods overlook the nuanced information specific to individual organs, which degrades the generation quality of chest X-ray image. Hence, we propose a novel Semantics guided Disentangled GAN (SD-GAN), which can generate the high-quality training data by fully utilizing the semantic information of different organs, for chest X-ray image rib segmentation. In particular, we use three ResNet50 branches to disentangle features of different organs, then use a decoder to combine features and generate corresponding images. To ensure that the generated images correspond to the input organ labels in semantics tags, we employ a semantics guidance module to perform semantic guidance on the generated images. To evaluate the efficacy of SD-GAN in generating high-quality samples, we introduce modified TransUNet(MTUNet), a specialized segmentation network designed for multi-scale contextual information extracting and multi-branch decoding, effectively tackling the challenge of organ overlap. We also propose a new chest X-ray image dataset (CXRS). It includes 1250 samples from various medical institutions. Lungs, clavicles, and 24 ribs are simultaneously annotated on each chest X-ray image. The visualization and quantitative results demonstrate the efficacy of SD-GAN in generating high-quality chest X-ray image-mask pairs. Using generated data, our trained MTUNet overcomes the limitations of the data scale and outperforms other segmentation networks.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Exploring a Physics-Informed Decision Transformer for Distribution System Restoration: Methodology and Performance Analysis
Authors:
Hong Zhao,
Jin Wei-Kocsis,
Adel Heidari Akhijahani,
Karen L Butler-Purry
Abstract:
Driven by advancements in sensing and computing, deep reinforcement learning (DRL)-based methods have demonstrated significant potential in effectively tackling distribution system restoration (DSR) challenges under uncertain operational scenarios. However, the data-intensive nature of DRL poses obstacles in achieving satisfactory DSR solutions for large-scale, complex distribution systems. Inspir…
▽ More
Driven by advancements in sensing and computing, deep reinforcement learning (DRL)-based methods have demonstrated significant potential in effectively tackling distribution system restoration (DSR) challenges under uncertain operational scenarios. However, the data-intensive nature of DRL poses obstacles in achieving satisfactory DSR solutions for large-scale, complex distribution systems. Inspired by the transformative impact of emerging foundation models, including large language models (LLMs), across various domains, this paper explores an innovative approach harnessing LLMs' powerful computing capabilities to address scalability challenges inherent in conventional DRL methods for solving DSR. To our knowledge, this study represents the first exploration of foundation models, including LLMs, in revolutionizing conventional DRL applications in power system operations. Our contributions are twofold: 1) introducing a novel LLM-powered Physics-Informed Decision Transformer (PIDT) framework that leverages LLMs to transform conventional DRL methods for DSR operations, and 2) conducting comparative studies to assess the performance of the proposed LLM-powered PIDT framework at its initial development stage for solving DSR problems. While our primary focus in this paper is on DSR operations, the proposed PIDT framework can be generalized to optimize sequential decision-making across various power system operations.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Prioritized experience replay-based DDQN for Unmanned Vehicle Path Planning
Authors:
Liu Lipeng,
Letian Xu,
Jiabei Liu,
Haopeng Zhao,
Tongzhou Jiang,
Tianyao Zheng
Abstract:
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it wi…
▽ More
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it with the prioritized experience replay method to solve the problem that traditional path planning algorithms often fall into dead zones. A series of simulation experiment results prove that the path planning algorithm based on DDQN is significantly better than other methods in terms of speed and accuracy, especially the ability to break through dead zones in extreme environments. Research shows that the path planning algorithm based on DDQN performs well in terms of path quality and safety. These research results provide an important reference for the research on automatic navigation of autonomous vehicles.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Benchmarking Semantic Communications for Image Transmission Over MIMO Interference Channels
Authors:
Yanhu Wang,
Shuaishuai Guo,
Anming Dong,
Hui Zhao
Abstract:
Semantic communications offer promising prospects for enhancing data transmission efficiency. However, existing schemes have predominantly concentrated on point-to-point transmissions. In this paper, we aim to investigate the validity of this claim in interference scenarios compared to baseline approaches. Specifically, our focus is on general multiple-input multiple-output (MIMO) interference cha…
▽ More
Semantic communications offer promising prospects for enhancing data transmission efficiency. However, existing schemes have predominantly concentrated on point-to-point transmissions. In this paper, we aim to investigate the validity of this claim in interference scenarios compared to baseline approaches. Specifically, our focus is on general multiple-input multiple-output (MIMO) interference channels, where we propose an interference-robust semantic communication (IRSC) scheme. This scheme involves the development of transceivers based on neural networks (NNs), which integrate channel state information (CSI) either solely at the receiver or at both transmitter and receiver ends. Moreover, we establish a composite loss function for training IRSC transceivers, along with a dynamic mechanism for updating the weights of various components in the loss function to enhance system fairness among users. Experimental results demonstrate that the proposed IRSC scheme effectively learns to mitigate interference and outperforms baseline approaches, particularly in low signal-to-noise (SNR) regimes.
△ Less
Submitted 10 April, 2024;
originally announced June 2024.
-
The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models
Authors:
Jiajia Li,
Lu Yang,
Mingni Tang,
Cong Chen,
Zuchao Li,
Ping Wang,
Hai Zhao
Abstract:
Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-rel…
▽ More
Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Fair Computation Offloading for RSMA-Assisted Mobile Edge Computing Networks
Authors:
Ding Xu,
Lingjie Duan,
Haitao Zhao,
Hongbo Zhu
Abstract:
Rate splitting multiple access (RSMA) provides a flexible transmission framework that can be applied in mobile edge computing (MEC) systems. However, the research work on RSMA-assisted MEC systems is still at the infancy and many design issues remain unsolved, such as the MEC server and channel allocation problem in general multi-server and multi-channel scenarios as well as the user fairness issu…
▽ More
Rate splitting multiple access (RSMA) provides a flexible transmission framework that can be applied in mobile edge computing (MEC) systems. However, the research work on RSMA-assisted MEC systems is still at the infancy and many design issues remain unsolved, such as the MEC server and channel allocation problem in general multi-server and multi-channel scenarios as well as the user fairness issues. In this regard, we study an RSMA-assisted MEC system with multiple MEC servers, channels and devices, and consider the fairness among devices. A max-min fairness computation offloading problem to maximize the minimum computation offloading rate is investigated. Since the problem is difficult to solve optimally, we develop an efficient algorithm to obtain a suboptimal solution. Particularly, the time allocation and the computing frequency allocation are derived as closed-form functions of the transmit power allocation and the successive interference cancellation (SIC) decoding order, while the transmit power allocation and the SIC decoding order are jointly optimized via the alternating optimization method, the bisection search method and the successive convex approximation method. For the channel and MEC server allocation problem, we transform it into a hypergraph matching problem and solve it by matching theory. Simulation results demonstrate that the proposed RSMA-assisted MEC system outperforms current MEC systems under various system setups.
△ Less
Submitted 1 August, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
Authors:
Linhan Ma,
Dake Guo,
Kun Song,
Yuepeng Jiang,
Shuai Wang,
Liumeng Xue,
Weiming Xu,
Huan Zhao,
Binbin Zhang,
Lei Xie
Abstract:
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio…
▽ More
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
△ Less
Submitted 19 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
HDRT: Infrared Capture for HDR Imaging
Authors:
Jingchao Peng,
Thomas Bashford-Rogers,
Francesco Banterle,
Haitao Zhao,
Kurt Debattista
Abstract:
Capturing real world lighting is a long standing challenge in imaging and most practical methods acquire High Dynamic Range (HDR) images by either fusing multiple exposures, or boosting the dynamic range of Standard Dynamic Range (SDR) images. Multiple exposure capture is problematic as it requires longer capture times which can often lead to ghosting problems. The main alternative, inverse tone m…
▽ More
Capturing real world lighting is a long standing challenge in imaging and most practical methods acquire High Dynamic Range (HDR) images by either fusing multiple exposures, or boosting the dynamic range of Standard Dynamic Range (SDR) images. Multiple exposure capture is problematic as it requires longer capture times which can often lead to ghosting problems. The main alternative, inverse tone mapping is an ill-defined problem that is especially challenging as single captured exposures usually contain clipped and quantized values, and are therefore missing substantial amounts of content. To alleviate this, we propose a new approach, High Dynamic Range Thermal (HDRT), for HDR acquisition using a separate, commonly available, thermal infrared (IR) sensor. We propose a novel deep neural method (HDRTNet) which combines IR and SDR content to generate HDR images. HDRTNet learns to exploit IR features linked to the RGB image and the IR-specific parameters are subsequently used in a dual branch method that fuses features at shallow layers. This produces an HDR image that is significantly superior to that generated using naive fusion approaches. To validate our method, we have created the first HDR and thermal dataset, and performed extensive experiments comparing HDRTNet with the state-of-the-art. We show substantial quantitative and qualitative quality improvements on both over- and under-exposed images, showing that our approach is robust to capturing in multiple different lighting conditions.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
An Improved Robust Total Logistic Distance Metric algorithm for Generalized Gaussian Noise and Noisy Input
Authors:
Haiquan Zhao,
Yi Peng,
Zian Cao
Abstract:
Although the known maximum total generalized correntropy (MTGC) and generalized maximum blakezisserman total correntropy (GMBZTC) algorithms can maintain good performance under the errors-in-variables (EIV) model disrupted by generalized Gaussian noise, their requirement for manual ad-justment of parameters is excessive, greatly increasing the practical difficulty of use. To solve this problem, th…
▽ More
Although the known maximum total generalized correntropy (MTGC) and generalized maximum blakezisserman total correntropy (GMBZTC) algorithms can maintain good performance under the errors-in-variables (EIV) model disrupted by generalized Gaussian noise, their requirement for manual ad-justment of parameters is excessive, greatly increasing the practical difficulty of use. To solve this problem, the total arctangent based on logical distance metric (TACLDM) algo-rithm is proposed by utilizing the advantage of few parameters in logical distance metric (LDM) theory and the convergence behavior is improved by the arctangent function. Compared with other competing algorithms, the TACLDM algorithm not only has fewer parameters, but also has better robustness to generalized Gaussian noise and significantly reduces the steady-state error. Furthermore, the analysis of the algorithm in the generalized Gaussian noise environment is analyzed in detail in this paper. Finally, computer simulations demonstrate the outstanding performance of the TACLDM algorithm and the rigorous theoretical deduction in this paper.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning
Authors:
Zhentao Liu,
Huangxuan Zhao,
Wenhui Qin,
Zhenghong Zhou,
Xinggang Wang,
Wenping Wang,
Xiaochun Lai,
Chuansheng Zheng,
Dinggang Shen,
Zhiming Cui
Abstract:
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substanti…
▽ More
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. However, sparse-view DSA reconstruction, aimed at reducing radiation dosage, is still underexplored in the research community. The dynamic blood flow and insufficient input of sparse-view DSA images present significant challenges to the 3D vessel reconstruction task. In this study, we propose to use a time-agnostic vessel probability field to solve this problem effectively. Our approach, termed as vessel probability guided attenuation learning, represents the DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the vessel probability field. Functioning as a dynamic mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism facilitates a self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves the reconstruction quality. Our model is trained by minimizing the disparity between synthesized projections and real captured DSA images. We further employ two training strategies to improve our reconstruction quality: (1) coarse-to-fine progressive training to achieve better geometry and (2) temporal perturbed rendering loss to enforce temporal consistency. Experimental results have demonstrated superior quality on both 3D vessel reconstruction and 2D view synthesis.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Adaptive Speech Emotion Representation Learning Based On Dynamic Graph
Authors:
Yingxue Gao,
Huan Zhao,
Zixing Zhang
Abstract:
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of gra…
▽ More
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of graph representation learning for sequential data. For this reason, we propose an adaptive graph representation learning method based on dynamically evolved graphs, which are consecutively constructed on a series of subsequences segmented by a sliding window. In doing this, it is better to capture local and global context information within a long sequence. Moreover, we introduce a weighted approach to update the node representation rather than the conventional average one, where the weights are calculated by a novel matrix computation based on the degree of neighboring nodes. Finally, we construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to optimize the graph structure. To verify the effectiveness of the proposed method, we conducted experiments for speech emotion recognition on the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed method outperforms the latest (non-)graph-based models.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Real-Time Convolutional Neural Network-Based Star Detection and Centroiding Method for CubeSat Star Tracker
Authors:
Hongrui Zhao,
Michael F. Lembeck,
Adrian Zhuang,
Riya Shah,
Jesse Wei
Abstract:
Star trackers are one of the most accurate celestial sensors used for absolute attitude determination. The devices detect stars in captured images and accurately compute their projected centroids on an imaging focal plane with subpixel precision. Traditional algorithms for star detection and centroiding often rely on threshold adjustments for star pixel detection and pixel brightness weighting for…
▽ More
Star trackers are one of the most accurate celestial sensors used for absolute attitude determination. The devices detect stars in captured images and accurately compute their projected centroids on an imaging focal plane with subpixel precision. Traditional algorithms for star detection and centroiding often rely on threshold adjustments for star pixel detection and pixel brightness weighting for centroid computation. However, challenges like high sensor noise and stray light can compromise algorithm performance. This article introduces a Convolutional Neural Network (CNN)-based approach for star detection and centroiding, tailored to address the issues posed by noisy star tracker images in the presence of stray light and other artifacts. Trained using simulated star images overlayed with real sensor noise and stray light, the CNN produces both a binary segmentation map distinguishing star pixels from the background and a distance map indicating each pixel's proximity to the nearest star centroid. Leveraging this distance information alongside pixel coordinates transforms centroid calculations into a set of trilateration problems solvable via the least squares method. Our method employs efficient UNet variants for the underlying CNN architectures, and the variants' performances are evaluated. Comprehensive testing has been undertaken with synthetic image evaluations, hardware-in-the-loop assessments, and night sky tests. The tests consistently demonstrated that our method outperforms several existing algorithms in centroiding accuracy and exhibits superior resilience to high sensor noise and stray light interference. An additional benefit of our algorithms is that they can be executed in real-time on low-power edge AI processors.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Generating Comprehensive Lithium Battery Charging Data with Generative AI
Authors:
Lidang Jiang,
Changyan Hu,
Sibei Ji,
Hang Zhao,
Junxiong Chen,
Ge He
Abstract:
In optimizing performance and extending the lifespan of lithium batteries, accurate state prediction is pivotal. Traditional regression and classification methods have achieved some success in battery state prediction. However, the efficacy of these data-driven approaches heavily relies on the availability and quality of public datasets. Additionally, generating electrochemical data predominantly…
▽ More
In optimizing performance and extending the lifespan of lithium batteries, accurate state prediction is pivotal. Traditional regression and classification methods have achieved some success in battery state prediction. However, the efficacy of these data-driven approaches heavily relies on the availability and quality of public datasets. Additionally, generating electrochemical data predominantly through battery experiments is a lengthy and costly process, making it challenging to acquire high-quality electrochemical data. This difficulty, coupled with data incompleteness, significantly impacts prediction accuracy. Addressing these challenges, this study introduces the End of Life (EOL) and Equivalent Cycle Life (ECL) as conditions for generative AI models. By integrating an embedding layer into the CVAE model, we developed the Refined Conditional Variational Autoencoder (RCVAE). Through preprocessing data into a quasi-video format, our study achieves an integrated synthesis of electrochemical data, including voltage, current, temperature, and charging capacity, which is then processed by the RCVAE model. Coupled with customized training and inference algorithms, this model can generate specific electrochemical data for EOL and ECL under supervised conditions. This method provides users with a comprehensive electrochemical dataset, pioneering a new research domain for the artificial synthesis of lithium battery data. Furthermore, based on the detailed synthetic data, various battery state indicators can be calculated, offering new perspectives and possibilities for lithium battery performance prediction.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail
Authors:
Mingjin Chen,
Junhao Chen,
Xiaojun Ye,
Huan-ang Gao,
Xiaoxue Chen,
Zhaoxin Fan,
Hao Zhao
Abstract:
3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the re…
▽ More
3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation
Authors:
Haoyu Zhao,
Wenhui Dong,
Rui Yu,
Zhou Zhao,
Du Bo,
Yongchao Xu
Abstract:
The task of single-source domain generalization (SDG) in medical image segmentation is crucial due to frequent domain shifts in clinical image datasets. To address the challenge of poor generalization across different domains, we introduce a Plug-and-Play module for data augmentation called MoreStyle. MoreStyle diversifies image styles by relaxing low-frequency constraints in Fourier space, guidin…
▽ More
The task of single-source domain generalization (SDG) in medical image segmentation is crucial due to frequent domain shifts in clinical image datasets. To address the challenge of poor generalization across different domains, we introduce a Plug-and-Play module for data augmentation called MoreStyle. MoreStyle diversifies image styles by relaxing low-frequency constraints in Fourier space, guiding the image reconstruction network. With the help of adversarial learning, MoreStyle further expands the style range and pinpoints the most intricate style combinations within latent features. To handle significant style variations, we introduce an uncertainty-weighted loss. This loss emphasizes hard-to-classify pixels resulting only from style shifts while mitigating true hard-to-classify pixels in both MoreStyle-generated and original images. Extensive experiments on two widely used benchmarks demonstrate that the proposed MoreStyle effectively helps to achieve good domain generalization ability, and has the potential to further boost the performance of some state-of-the-art SDG methods. Source code is available at https://github.com/zhaohaoyu376/morestyle.
△ Less
Submitted 1 July, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising
Authors:
Haoyu Zhao,
Yuliang Gu,
Zhou Zhao,
Bo Du,
Yongchao Xu,
Rui Yu
Abstract:
In clinical examinations and diagnoses, low-dose computed tomography (LDCT) is crucial for minimizing health risks compared with normal-dose computed tomography (NDCT). However, reducing the radiation dose compromises the signal-to-noise ratio, leading to degraded quality of CT images. To address this, we analyze LDCT denoising task based on experimental results from the frequency perspective, and…
▽ More
In clinical examinations and diagnoses, low-dose computed tomography (LDCT) is crucial for minimizing health risks compared with normal-dose computed tomography (NDCT). However, reducing the radiation dose compromises the signal-to-noise ratio, leading to degraded quality of CT images. To address this, we analyze LDCT denoising task based on experimental results from the frequency perspective, and then introduce a novel self-supervised CT image denoising method called WIA-LD2ND, only using NDCT data. The proposed WIA-LD2ND comprises two modules: Wavelet-based Image Alignment (WIA) and Frequency-Aware Multi-scale Loss (FAM). First, WIA is introduced to align NDCT with LDCT by mainly adding noise to the high-frequency components, which is the main difference between LDCT and NDCT. Second, to better capture high-frequency components and detailed information, Frequency-Aware Multi-scale Loss (FAM) is proposed by effectively utilizing multi-scale feature space. Extensive experiments on two public LDCT denoising datasets demonstrate that our WIA-LD2ND, only uses NDCT, outperforms existing several state-of-the-art weakly-supervised and self-supervised methods. Source code is available at https://github.com/zhaohaoyu376/WI-LD2ND.
△ Less
Submitted 1 July, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Matched-filter Precoded Rate Splitting Multiple Access: A Simple and Energy-efficient Design
Authors:
Hui Zhao,
Dirk Slock
Abstract:
We introduce an energy-efficient downlink rate splitting multiple access (RSMA) scheme, employing a simple matched filter (MF) for precoding. We consider a transmitter equipped with multiple antennas, serving several single-antenna users at the same frequency-time resource, each with distinct message requests. Within the conventional 1-layer RSMA framework, requested messages undergo splitting int…
▽ More
We introduce an energy-efficient downlink rate splitting multiple access (RSMA) scheme, employing a simple matched filter (MF) for precoding. We consider a transmitter equipped with multiple antennas, serving several single-antenna users at the same frequency-time resource, each with distinct message requests. Within the conventional 1-layer RSMA framework, requested messages undergo splitting into common and private streams, which are then precoded separately before transmission. In contrast, we propose a novel strategy where only an MF is employed to precode both the common and private streams in RSMA, promising significantly improved energy efficiency and reduced complexity. We demonstrate that this MF-precoded RSMA achieves the same delivery performance as conventional RSMA, where the common stream is beamformed using maximal ratio transmission (MRT) and the private streams are precoded by MF. Taking into account imperfect channel state information at the transmitter, we proceed to analyze the delivery performance of the MF-precoded RSMA. We derive the ergodic rates for decoding the common and private streams at a target user respectively in the massive MIMO regime. Finally, numerical simulations validate the accuracy of our analytical models, as well as demonstrate the advantages over conventional RSMA.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
APISR: Anime Production Inspired Real-World Anime Super-Resolution
Authors:
Boyang Wang,
Fengyu Yang,
Xihang Yu,
Chao Zhang,
Hanbin Zhao
Abstract:
While real-world anime super-resolution (SR) has gained increasing attention in the SR community, existing methods still adopt techniques from the photorealistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to t…
▽ More
While real-world anime super-resolution (SR) has gained increasing attention in the SR community, existing methods still adopt techniques from the photorealistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead, we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline, we introduce the Anime Production-oriented Image (API) dataset. In addition, we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth preparation with enhanced hand-drawn lines. In addition, we introduce the balanced twin perceptual loss combining both anime and photorealistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark, showing our method outperforms state-of-the-art anime dataset-trained approaches.
△ Less
Submitted 4 April, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
Message-Enhanced DeGroot Model
Authors:
Huisheng Wang,
Zhanjiang Chen,
H. Vicky Zhao
Abstract:
Understanding the impact of messages on agents' opinions over social networks is important. However, to our best knowledge, there has been limited quantitative investigation into this phenomenon in the prior works. To address this gap, this paper proposes the Message-Enhanced DeGroot model. The Bounded Brownian Message model provides a quantitative description of the message evolution, jointly con…
▽ More
Understanding the impact of messages on agents' opinions over social networks is important. However, to our best knowledge, there has been limited quantitative investigation into this phenomenon in the prior works. To address this gap, this paper proposes the Message-Enhanced DeGroot model. The Bounded Brownian Message model provides a quantitative description of the message evolution, jointly considering temporal continuity, randomness, and polarization from mass media theory. The Message-Enhanced DeGroot model, combining the Bounded Brownian Message model with the traditional DeGroot model, quantitatively describes the evolution of agents' opinions under the influence of messages. We theoretically study the probability distribution and statistics of the messages and agents' opinions and quantitatively analyze the impact of messages on opinions. We also conduct simulations to validate our analyses.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Power Optimization for Integrated Active and Passive Sensing in DFRC Systems
Authors:
Xingliang Lou,
Wenchao Xia,
Kai-Kit Wong,
Haitao Zhao,
Tony Q. S. Quek,
Hongbo Zhu
Abstract:
Most existing works on dual-function radar-communication (DFRC) systems mainly focus on active sensing, but ignore passive sensing. To leverage multi-static sensing capability, we explore integrated active and passive sensing (IAPS) in DFRC systems to remedy sensing performance. The multi-antenna base station (BS) is responsible for communication and active sensing by transmitting signals to user…
▽ More
Most existing works on dual-function radar-communication (DFRC) systems mainly focus on active sensing, but ignore passive sensing. To leverage multi-static sensing capability, we explore integrated active and passive sensing (IAPS) in DFRC systems to remedy sensing performance. The multi-antenna base station (BS) is responsible for communication and active sensing by transmitting signals to user equipments while detecting a target according to echo signals. In contrast, passive sensing is performed at the receive access points (RAPs). We consider both the cases where the capacity of the backhaul links between the RAPs and BS is unlimited or limited and adopt different fusion strategies. Specifically, when the backhaul capacity is unlimited, the BS and RAPs transfer sensing signals they have received to the central controller (CC) for signal fusion. The CC processes the signals and leverages the generalized likelihood ratio test detector to determine the present of a target. However, when the backhaul capacity is limited, each RAP, as well as the BS, makes decisions independently and sends its binary inference results to the CC for result fusion via voting aggregation. Then, aiming at maximize the target detection probability under communication quality of service constraints, two power optimization algorithms are proposed. Finally, numerical simulations demonstrate that the sensing performance in case of unlimited backhaul capacity is much better than that in case of limited backhaul capacity. Moreover, it implied that the proposed IAPS scheme outperforms only-passive and only-active sensing schemes, especially in unlimited capacity case.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Authors:
Hang Zhao,
Yifei Xin,
Zhesong Yu,
Bilei Zhu,
Lu Lu,
Zejun Ma
Abstract:
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instructio…
▽ More
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
△ Less
Submitted 11 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
KS-Net: Multi-band joint speech restoration and enhancement network for 2024 ICASSP SSI Challenge
Authors:
Guochen Yu,
Runqiang Han,
Chenglin Xu,
Haoran Zhao,
Nan Li,
Chen Zhang,
Xiguang Zheng,
Chao Zhou,
Qi Huang,
Bing Yu
Abstract:
This paper presents the speech restoration and enhancement system created by the 1024K team for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. Our system consists of a generative adversarial network (GAN) in complex-domain for speech restoration and a fine-grained multi-band fusion module for speech enhancement. In the blind test set of SSI, the proposed system achieves an overall mean…
▽ More
This paper presents the speech restoration and enhancement system created by the 1024K team for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. Our system consists of a generative adversarial network (GAN) in complex-domain for speech restoration and a fine-grained multi-band fusion module for speech enhancement. In the blind test set of SSI, the proposed system achieves an overall mean opinion score (MOS) of 3.49 based on ITU-T P.804 and a Word Accuracy Rate (WAcc) of 0.78 for the real-time track, as well as an overall P.804 MOS of 3.43 and a WAcc of 0.78 for the non-real-time track, ranking 1st in both tracks.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Active Support of Inverters for Improving Short-Term Voltage Security in 100% IBRsPenetrated Power Systems
Authors:
Yinhong Lin,
Bin Wang,
Qinglai Guo,
Haotian Zhao,
Hongbin Sun
Abstract:
Due to the energy crisis and environmental pollution, the installed capacity of inverter-based resources (IBRs) in power grids is rapidly increasing, and grid-following control (GFL) is the most prevalent at present. Meanwhile, grid-forming control-based (GFM) devices have been installed in the grid to provide active support for frequency and voltage. In the future GFL devices combined with GFM wi…
▽ More
Due to the energy crisis and environmental pollution, the installed capacity of inverter-based resources (IBRs) in power grids is rapidly increasing, and grid-following control (GFL) is the most prevalent at present. Meanwhile, grid-forming control-based (GFM) devices have been installed in the grid to provide active support for frequency and voltage. In the future GFL devices combined with GFM will be promising, especially in power systems with high penetration or 100% IBRs. When a short-circuit fault occurs in the grid, the controlled current source characteristic of the GFL devices leads to insufficient dynamic voltage support (DVS), while the GFM devices usually reduce the internal voltage to limit the current. Thus, deep voltage sags and undesired disconnections of IBRs may occur. Moreover, due to the dispersed locations and the control strategies' diversity of IBRs, the voltage support of different devices may not be fully coordinated, which is not conducive to short-term voltage security (STVS). To address this issue, a control scheme based on the simulation of transient characteristics of synchronous machines (SMs) is proposed. Then, a new fault ride-through strategy (FRT) is proposed based on the characteristic differences between GFL and GFM devices, and an optimization model of multi-device control parameters is formulated to meet the short-term voltage security constraints (SVSCs) and device capacity constraints. Finally, a fast solution method based on analytical modeling is proposed for the model. Test results based on the doublegenerator-one-load system, the IEEE 14-bus system, and other systems of different sizes show that the proposed method can effectively enhance the active support capability of GFL and GFM to the grid voltage, and avoid the large-scale disconnection of IBRs
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Exact SINR Analysis of Matched-filter Precoder
Authors:
Hui Zhao,
Dirk Slock,
Petros Elia
Abstract:
This paper answers a fundamental question about the exact distribution of the signal-to-interference-plus-noise ratio (SINR) under matched-filter (MF) precoding. Specifically, we derive the exact expressions for the cumulative distribution function (CDF) and the probability density function (PDF) of SINR under MF precoding over Rayleigh fading channels. Based on the exact analysis, we then rigorou…
▽ More
This paper answers a fundamental question about the exact distribution of the signal-to-interference-plus-noise ratio (SINR) under matched-filter (MF) precoding. Specifically, we derive the exact expressions for the cumulative distribution function (CDF) and the probability density function (PDF) of SINR under MF precoding over Rayleigh fading channels. Based on the exact analysis, we then rigorously prove that the SINR converges to some specific distributions separately in high SNR and in massive MIMO. To simplify the exact result in general cases, we develop a good approximation by modelling the interference as a Beta distribution. We then shift to the exact analysis of the transmit rate, and answer the fundamental question: How does the exact rate converge to the well-known asymptotic rate in massive MIMO? After that, we propose a novel approximation for the ergodic rate, which performs better than various existing approximations. Finally, we present some numerical results to demonstrate the accuracy of the derived analytical models.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings
Authors:
He Zhao,
Hangting Chen,
Jianwei Yu,
Yuehai Wang
Abstract:
Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker…
▽ More
Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metrics, including diarization and enhancement metrics. Our experiments demonstrate that A-TSVAD outperforms conventional methods in reducing diarization errors. Furthermore, the integration of A-TSVAD and TSE in a sequential cascaded manner further enhances extraction accuracy.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Power System Fault Diagnosis with Quantum Computing and Efficient Gate Decomposition
Authors:
Xiang Fei,
Huan Zhao,
Xiyuan Zhou,
Junhua Zhao,
Ting Shu,
Fushuan Wen
Abstract:
Power system fault diagnosis is crucial for identifying the location and causes of faults and providing decision-making support for power dispatchers. However, most classical methods suffer from significant time-consuming, memory overhead, and computational complexity issues as the scale of the power system concerned increases. With rapid development of quantum computing technology, the combinator…
▽ More
Power system fault diagnosis is crucial for identifying the location and causes of faults and providing decision-making support for power dispatchers. However, most classical methods suffer from significant time-consuming, memory overhead, and computational complexity issues as the scale of the power system concerned increases. With rapid development of quantum computing technology, the combinatorial optimization method based on quantum computing has shown certain advantages in computational time over existing methods. Given this background, this paper proposes a quantum computing based power system fault diagnosis method with the Quantum Approximate Optimization Algorithm (QAOA). The proposed method reformulates the fault diagnosis problem as a Hamiltonian by using Ising model, which completely preserves the coupling relationship between faulty components and various operations of protective relays and circuit breakers. Additionally, to enhance problem-solving efficiency under current equipment limitations, the symmetric equivalent decomposition method of multi-z-rotation gate is proposed. Furthermore, the small probability characteristics of power system events is utilized to reduce the number of qubits. Simulation results based on the test system show that the proposed methods can achieve the same optimal results with a faster speed compared with the classical higher-order solver provided by D-Wave.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Optimal Investment with Herd Behaviour Using Rational Decision Decomposition
Authors:
Huisheng Wang,
H. Vicky Zhao
Abstract:
In this paper, we study the optimal investment problem considering the herd behaviour between two agents, including one leading expert and one following agent whose decisions are influenced by those of the leading expert. In the objective functional of the optimal investment problem, we introduce the average deviation term to measure the distance between the two agents' decisions and use the varia…
▽ More
In this paper, we study the optimal investment problem considering the herd behaviour between two agents, including one leading expert and one following agent whose decisions are influenced by those of the leading expert. In the objective functional of the optimal investment problem, we introduce the average deviation term to measure the distance between the two agents' decisions and use the variational method to find its analytical solution. To theoretically analyze the impact of the following agent's herd behaviour on his/her decision, we decompose his/her optimal decision into a convex linear combination of the two agents' rational decisions, which we call the rational decision decomposition. Furthermore, we define the weight function in the rational decision decomposition as the following agent's investment opinion to measure the preference of his/her own rational decision over that of the leading expert. We use the investment opinion to quantitatively analyze the impact of the herd behaviour, the following agent's initial wealth, the excess return, and the volatility of the risky asset on the optimal decision. We validate our analyses through numerical experiments on real stock data. This study is crucial to understanding investors' herd behaviour in decision-making and designing effective mechanisms to guide their decisions.
△ Less
Submitted 15 July, 2024; v1 submitted 13 January, 2024;
originally announced January 2024.
-
Data-Driven Estimation of Failure Probabilities in Correlated Structure-Preserving Stochastic Power System Models
Authors:
Hongli Zhao,
Tyler E. Maltba,
D. Adrian Maldonado,
Emil Constantinescu,
Mihai Anitescu
Abstract:
We propose a data-driven approach for propagating uncertainty in stochastic power grid simulations and apply it to the estimation of transmission line failure probabilities. A reduced-order equation governing the evolution of the observed line energy probability density function is derived from the Fokker--Planck equation of the full-order continuous Markov process. Our method consists of estimate…
▽ More
We propose a data-driven approach for propagating uncertainty in stochastic power grid simulations and apply it to the estimation of transmission line failure probabilities. A reduced-order equation governing the evolution of the observed line energy probability density function is derived from the Fokker--Planck equation of the full-order continuous Markov process. Our method consists of estimates produced by numerically integrating this reduced equation. Numerical experiments for scalar- and vector-valued energy functions are conducted using the classical multimachine model under spatiotemporally correlated noise perturbation. The method demonstrates a more sample-efficient approach for computing probabilities of tail events when compared with kernel density estimation. Moreover, it produces vastly more accurate estimates of joint event occurrence when compared with independent models.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Exploiting Multipath Information for Integrated Localization and Sensing via PHD Filtering
Authors:
Yinuo Du,
Hanying Zhao,
Yang Liu,
Xinlei Yu,
Yuan Shen
Abstract:
Accurate localization and perception are pivotal for enhancing the safety and reliability of vehicles. However, current localization methods suffer from reduced accuracy when the line-of-sight (LOS) path is obstructed, or a combination of reflections and scatterings is present. In this paper, we present an integrated localization and sensing method that delivers superior performance in complex env…
▽ More
Accurate localization and perception are pivotal for enhancing the safety and reliability of vehicles. However, current localization methods suffer from reduced accuracy when the line-of-sight (LOS) path is obstructed, or a combination of reflections and scatterings is present. In this paper, we present an integrated localization and sensing method that delivers superior performance in complex environments while being computationally efficient. Our method uniformly leverages various types of multipath components (MPCs) through the lens of random finite sets (RFSs), encompassing reflections, scatterings, and their combinations. This advancement eliminates the need for the multipath identification step and streamlines the filtering process by removing the necessity for distinct filters for different multipath types, a requirement that was critical in previous research. The simulation results demonstrate the superior performance of our method in both robustness and effectiveness, particularly in complex environments where the LOS MPC is obscured and in situations involving clutter and missed detection of MPC measurements.
△ Less
Submitted 15 August, 2024; v1 submitted 24 December, 2023;
originally announced December 2023.
-
Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier
Authors:
Yinlin Guo,
Haofan Huang,
Xi Chen,
He Zhao,
Yuehai Wang
Abstract:
With the rapid development of speech synthesis and voice conversion technologies, Audio Deepfake has become a serious threat to the Automatic Speaker Verification (ASV) system. Numerous countermeasures are proposed to detect this type of attack. In this paper, we report our efforts to combine the self-supervised WavLM model and Multi-Fusion Attentive classifier for audio deepfake detection. Our me…
▽ More
With the rapid development of speech synthesis and voice conversion technologies, Audio Deepfake has become a serious threat to the Automatic Speaker Verification (ASV) system. Numerous countermeasures are proposed to detect this type of attack. In this paper, we report our efforts to combine the self-supervised WavLM model and Multi-Fusion Attentive classifier for audio deepfake detection. Our method exploits the WavLM model to extract features that are more conducive to spoofing detection for the first time. Then, we propose a novel Multi-Fusion Attentive (MFA) classifier based on the Attentive Statistics Pooling (ASP) layer. The MFA captures the complementary information of audio features at both time and layer levels. Experiments demonstrate that our methods achieve state-of-the-art results on the ASVspoof 2021 DF set and provide competitive results on the ASVspoof 2019 and 2021 LA set.
△ Less
Submitted 9 January, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Projective Parallel Single-Pixel Imaging: 3D Structured Light Scanning Under Global Illumination
Authors:
Yuxi Li,
Hongzhi Jiang,
Huijie Zhao,
Xudong Li
Abstract:
We present projective parallel single-pixel imaging (pPSI), a 3D photography method that provides a robust and efficient way to analyze the light transport behavior and enables separation of light effect due to global illumination, thereby achieving 3D structured light scanning under global illumination. The light transport behavior is described by the light transport coefficients (LTC), which con…
▽ More
We present projective parallel single-pixel imaging (pPSI), a 3D photography method that provides a robust and efficient way to analyze the light transport behavior and enables separation of light effect due to global illumination, thereby achieving 3D structured light scanning under global illumination. The light transport behavior is described by the light transport coefficients (LTC), which contain complete information for a projector camera pair, and is a 4D data set. However, the capture of LTC is generally time consuming. The 4D LTC in pPSI are reduced to projection functions, thereby enabling a highly efficient data capture process. We introduce the local maximum constraint, which provides constraint for the location of candidate correspondence matching points when projections are captured. Local slice extension (LSE) method is introduced to accelerate the capture of projection functions. Optimization is conducted for pPSI under several situations. The number of projection functions required for pPSI is optimized and the influence of capture ratio in LSE on the accuracy of the correspondence matching points is investigated. Discussions and experiments include two typical kinds of global illuminations: inter-reflections and subsurface scattering. The proposed method is validated with several challenging scenarios, and outperforms the state-of-the-art methods.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Holistic Evaluation of GPT-4V for Biomedical Imaging
Authors:
Zhengliang Liu,
Hanqi Jiang,
Tianyang Zhong,
Zihao Wu,
Chong Ma,
Yiwei Li,
Xiaowei Yu,
Yutong Zhang,
Yi Pan,
Peng Shu,
Yanjun Lyu,
Lu Zhang,
Junjie Yao,
Peixin Dong,
Chao Cao,
Zhenxiang Xiao,
Jiaqi Wang,
Huan Zhao,
Shaochen Xu,
Yaonai Wei,
Jingyuan Chen,
Haixing Dai,
Peilong Wang,
Hao He,
Zewei Wang
, et al. (25 additional authors not shown)
Abstract:
In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and mor…
▽ More
In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.
△ Less
Submitted 10 November, 2023;
originally announced December 2023.
-
Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization
Authors:
Huan Zhao,
Li Zhang,
Yue Li,
Yannan Wang,
Hongji Wang,
Wei Rao,
Qing Wang,
Lei Xie
Abstract:
The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM…
▽ More
The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Applying Large Language Models to Power Systems: Potential Security Threats
Authors:
Jiaqi Ruan,
Gaoqi Liang,
Huan Zhao,
Guolong Liu,
Xianzhuo Sun,
Jing Qiu,
Zhao Xu,
Fushuan Wen,
Zhao Yang Dong
Abstract:
Applying large language models (LLMs) to modern power systems presents a promising avenue for enhancing decision-making and operational efficiency. However, this action may also incur potential security threats, which have not been fully recognized so far. To this end, this article analyzes potential threats incurred by applying LLMs to power systems, emphasizing the need for urgent research and d…
▽ More
Applying large language models (LLMs) to modern power systems presents a promising avenue for enhancing decision-making and operational efficiency. However, this action may also incur potential security threats, which have not been fully recognized so far. To this end, this article analyzes potential threats incurred by applying LLMs to power systems, emphasizing the need for urgent research and development of countermeasures.
△ Less
Submitted 24 January, 2024; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Secure Rate-Splitting Multiple Access Transmissions in LMS Systems
Authors:
Minjue He,
Hui Zhao,
Xiaqing Miao,
Shuai Wang,
Gaofeng Pan
Abstract:
This letter investigates the secure delivery performance of the rate-splitting multiple access scheme in land mobile satellite (LMS) systems, considering that the private messages intended by a terminal can be eavesdropped by any others from the broadcast signals. Specifically, the considered system has an N-antenna satellite and numerous single-antenna land users. Maximum ratio transmission (MRT)…
▽ More
This letter investigates the secure delivery performance of the rate-splitting multiple access scheme in land mobile satellite (LMS) systems, considering that the private messages intended by a terminal can be eavesdropped by any others from the broadcast signals. Specifically, the considered system has an N-antenna satellite and numerous single-antenna land users. Maximum ratio transmission (MRT) and matched-filtering (MF) precoding techniques are adopted at the satellite separately for the common messages (CMs) and for the private messages (PMs), which are both implemented based on the estimated LMS channels suffering from the Shadowed-Rician fading. Then, closed-form expressions are derived for the ergodic rates for decoding the CM, and for decoding the PM at the intended user respectively, and more importantly, we also derive the ergodic secrecy rate against eavesdropping. Finally, numerical results are provided to validate the correctness of the proposed analysis models, as well as to show some interesting comparisons.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes via Deviation Relationship Learning
Authors:
Jingtao Li,
Xinyu Wang,
Hengwei Zhao,
Liangpei Zhang,
Yanfei Zhong
Abstract:
Remote sensing anomaly detector can find the objects deviating from the background as potential targets. Given the diversity in earth anomaly types, a unified anomaly detector across modalities and scenes should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors are limited to a single modality and single scene, since they aim…
▽ More
Remote sensing anomaly detector can find the objects deviating from the background as potential targets. Given the diversity in earth anomaly types, a unified anomaly detector across modalities and scenes should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors are limited to a single modality and single scene, since they aim to learn the varying background distribution. Motivated by the universal anomaly deviation pattern, in that anomalies exhibit deviations from their local context, we exploit this characteristic to build a unified anomaly detector. Firstly, we reformulate the anomaly detection task as an undirected bilayer graph based on the deviation relationship, where the anomaly score is modeled as the conditional probability, given the pattern of the background and normal objects. The learning objective is then expressed as a conditional probability ranking problem. Furthermore, we design an instantiation of the reformulation in the data, architecture, and optimization aspects. Simulated spectral and spatial anomalies drive the instantiated architecture. The model is optimized directly for the conditional probability ranking. The proposed model was validated in five modalities including the hyperspectral, visible light, synthetic aperture radar (SAR), infrared and low light to show its unified detection ability.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Plane Constraints Aided Multi-Vehicle Cooperative Positioning Using Factor Graph Optimization
Authors:
Chen Zhuang,
Hongbo Zhao
Abstract:
The development of vehicle-to-vehicle (V2V) communication facil-itates the study of cooperative positioning (CP) techniques for vehicular applications. The CP methods can improve the posi-tioning availability and accuracy by inter-vehicle ranging and data exchange between vehicles. However, the inter-vehicle rang-ing can be easily interrupted due to many factors such as obsta-cles in-between two c…
▽ More
The development of vehicle-to-vehicle (V2V) communication facil-itates the study of cooperative positioning (CP) techniques for vehicular applications. The CP methods can improve the posi-tioning availability and accuracy by inter-vehicle ranging and data exchange between vehicles. However, the inter-vehicle rang-ing can be easily interrupted due to many factors such as obsta-cles in-between two cars. Without inter-vehicle ranging, the other cooperative data such as vehicle positions will be wasted, leading to performance degradation of range-based CP methods. To fully utilize the cooperative data and mitigate the impact of inter-vehicle ranging loss, a novel cooperative positioning method aided by plane constraints is proposed in this paper. The positioning results received from cooperative vehicles are used to construct the road plane for each vehicle. The plane parameters are then introduced into CP scheme to impose constraints on positioning solutions. The state-of-art factor graph optimization (FGO) algo-rithm is employed to integrate the plane constraints with raw data of Global Navigation Satellite Systems (GNSS) as well as inter-vehicle ranging measurements. The proposed CP method has the ability to resist the interruptions of inter-vehicle ranging since the plane constraints are computed by just using position-related data. A vehicle can still benefit from the position data of cooperative vehicles even if the inter-vehicle ranging is unavaila-ble. The experimental results indicate the superiority of the pro-posed CP method in positioning performance over the existing methods, especially when the inter-ranging interruptions occur.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Cross-adversarial local distribution regularization for semi-supervised medical image segmentation
Authors:
Thanh Nguyen-Duc,
Trung Le,
Roland Bammer,
He Zhao,
Jianfei Cai,
Dinh Phung
Abstract:
Medical semi-supervised segmentation is a technique where a model is trained to segment objects of interest in medical images with limited annotated data. Existing semi-supervised segmentation methods are usually based on the smoothness assumption. This assumption implies that the model output distributions of two similar data samples are encouraged to be invariant. In other words, the smoothness…
▽ More
Medical semi-supervised segmentation is a technique where a model is trained to segment objects of interest in medical images with limited annotated data. Existing semi-supervised segmentation methods are usually based on the smoothness assumption. This assumption implies that the model output distributions of two similar data samples are encouraged to be invariant. In other words, the smoothness assumption states that similar samples (e.g., adding small perturbations to an image) should have similar outputs. In this paper, we introduce a novel cross-adversarial local distribution (Cross-ALD) regularization to further enhance the smoothness assumption for semi-supervised medical image segmentation task. We conducted comprehensive experiments that the Cross-ALD archives state-of-the-art performance against many recent methods on the public LA and ACDC datasets.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Scalable Neural Dynamic Equivalence for Power Systems
Authors:
Qing Shen,
Yifan Zhou,
Huanfeng Zhao,
Peng Zhang,
Qiang Zhang,
Slava Maslenniko,
Xiaochuan Luo
Abstract:
Traditional grid analytics are model-based, relying strongly on accurate models of power systems, especially the dynamic models of generators, controllers, loads and other dynamic components. However, acquiring thorough power system models can be impractical in real operation due to inaccessible system parameters and privacy of consumers, which necessitate data-driven dynamic equivalencing of unkn…
▽ More
Traditional grid analytics are model-based, relying strongly on accurate models of power systems, especially the dynamic models of generators, controllers, loads and other dynamic components. However, acquiring thorough power system models can be impractical in real operation due to inaccessible system parameters and privacy of consumers, which necessitate data-driven dynamic equivalencing of unknown subsystems. Learning reliable dynamic equivalent models for the external systems from SCADA and PMU data, however, is a long-standing intractable problem in power system analysis due to complicated nonlinearity and unforeseeable dynamic modes of power systems. This paper advances a practical application of neural dynamic equivalence (NeuDyE) called Driving Port NeuDyE (DP-NeuDyE), which exploits physics-informed machine learning and neural-ordinary-differential-equations (ODE-NET) to discover a dynamic equivalence of external power grids while preserving its dynamic behaviors after disturbances. The new contributions are threefold: A NeuDyE formulation to enable a continuous-time, data-driven dynamic equivalence of power systems, saving the effort and expense of acquiring inaccessible system; An introduction of a Physics-Informed NeuDyE learning (PI-NeuDyE) to actively control the closed-loop accuracy of NeuDyE; and A DP-NeuDyE to reduce the number of inputs required for the training. We conduct extensive case studies on the NPCC system to validate the generalizability and accuracy of both PI-NeuDyE and DP-NeuDyE, which span a multitude of scenarios, differing in the time required for fault clearance, the specific fault locations, and the limitations of data. Test results have demonstrated the scalability and practicality of NeuDyE, showing its potential to be used in ISO and utility control centers for online transient stability analysis and for planning purposes.
△ Less
Submitted 21 March, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems
Authors:
Boyu Tian,
Haiquan Zhao
Abstract:
The minimum error entropy (MEE) has been extensively used in unscented Kalman filter (UKF) to handle impulsive noises or abnormal measurement data in non-Gaussian systems. However, the MEE-UKF has poor numerical stability due to the inverse operation of singular matrix. In this paper, a novel UKF based on minimum error entropy with fiducial points (MEEF) is proposed \textcolor{black}{to improve th…
▽ More
The minimum error entropy (MEE) has been extensively used in unscented Kalman filter (UKF) to handle impulsive noises or abnormal measurement data in non-Gaussian systems. However, the MEE-UKF has poor numerical stability due to the inverse operation of singular matrix. In this paper, a novel UKF based on minimum error entropy with fiducial points (MEEF) is proposed \textcolor{black}{to improve the problem of non-positive definite key matrix. By adding the correntropy to the error entropy, the proposed algorithm further enhances the ability of suppressing impulse noise and outliers. At the same time, considering the uncertainty of noise distribution, the modified Sage-Husa estimator of noise statistics is introduced to adaptively update the noise covariance matrix. In addition, the convergence analysis of the proposed algorithm provides a guidance for the selection of kernel width. The robustness and estimation accuracy of the proposed algorithm are manifested by the state tracking examples under complex non-Gaussian noises.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Intelligent machines work in unstructured environments by differential neuromorphic computing
Authors:
Shengbo Wang,
Shuo Gao,
Chenyu Tang,
Edoardo Occhipinti,
Cong Li,
Shurui Wang,
Jiaqi Wang,
Hubin Zhao,
Guohua Hu,
Arokia Nathan,
Ravinder Dahiya,
Luigi Occhipinti
Abstract:
Efficient operation of intelligent machines in the real world requires methods that allow them to understand and predict the uncertainties presented by the unstructured environments with good accuracy, scalability and generalization, similar to humans. Current methods rely on pretrained networks instead of continuously learning from the dynamic signal properties of working environments and suffer…
▽ More
Efficient operation of intelligent machines in the real world requires methods that allow them to understand and predict the uncertainties presented by the unstructured environments with good accuracy, scalability and generalization, similar to humans. Current methods rely on pretrained networks instead of continuously learning from the dynamic signal properties of working environments and suffer inherent limitations, such as data-hungry procedures, and limited generalization capabilities. Herein, we present a memristor-based differential neuromorphic computing, perceptual signal processing and learning method for intelligent machines. The main features of environmental information such as amplification (>720%) and adaptation (<50%) of mechanical stimuli encoded in memristors, are extracted to obtain human-like processing in unstructured environments. The developed method takes advantage of the intrinsic multi-state property of memristors and exhibits good scalability and generalization, as confirmed by validation in two different application scenarios: object grasping and autonomous driving. In the former, a robot hand experimentally realizes safe and stable grasping through fast learning (in ~1 ms) the unknown object features (e.g., sharp corner and smooth surface) with a single memristor. In the latter, the decision-making information of 10 unstructured environments in autonomous driving (e.g., overtaking cars, pedestrians) is accurately (94%) extracted with a 40*25 memristor array. By mimicking the intrinsic nature of human low-level perception mechanisms, the electronic memristive neuromorphic circuit-based method, presented here shows the potential for adapting to diverse sensing technologies and helping intelligent machines generate smart high-level decisions in the real world.
△ Less
Submitted 17 November, 2023; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Opinion Dynamics in Two-Step Process: Message Sources, Opinion Leaders and Normal Agents
Authors:
Huisheng Wang,
Yuejiang Li,
Yiqing Lin,
H. Vicky Zhao
Abstract:
According to mass media theory, the dissemination of messages and the evolution of opinions in social networks follow a two-step process. First, opinion leaders receive the message from the message sources, and then they transmit their opinions to normal agents. However, most opinion models only consider the evolution of opinions within a single network, which fails to capture the two-step process…
▽ More
According to mass media theory, the dissemination of messages and the evolution of opinions in social networks follow a two-step process. First, opinion leaders receive the message from the message sources, and then they transmit their opinions to normal agents. However, most opinion models only consider the evolution of opinions within a single network, which fails to capture the two-step process accurately. To address this limitation, we propose a unified framework called the Two-Step Model, which analyzes the communication process among message sources, opinion leaders, and normal agents. In this study, we examine the steady-state opinions and stability of the Two-Step Model. Our findings reveal that several factors, such as message distribution, initial opinion, level of stubbornness, and preference coefficient, influence the sample mean and variance of steady-state opinions. Notably, normal agents' opinions tend to be influenced by opinion leaders in the two-step process. We also conduct numerical and social experiments to validate the accuracy of the Two-Step Model, which outperforms other models on average. Our results provide valuable insights into the factors that shape social opinions and can guide the development of effective strategies for opinion guidance in social networks.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
Generalized Minimum Error with Fiducial Points Criterion for Robust Learning
Authors:
Haiquan Zhao,
Yuan Gao,
Yingying Zhu
Abstract:
The conventional Minimum Error Entropy criterion (MEE) has its limitations, showing reduced sensitivity to error mean values and uncertainty regarding error probability density function locations. To overcome this, a MEE with fiducial points criterion (MEEF), was presented. However, the efficacy of the MEEF is not consistent due to its reliance on a fixed Gaussian kernel. In this paper, a generali…
▽ More
The conventional Minimum Error Entropy criterion (MEE) has its limitations, showing reduced sensitivity to error mean values and uncertainty regarding error probability density function locations. To overcome this, a MEE with fiducial points criterion (MEEF), was presented. However, the efficacy of the MEEF is not consistent due to its reliance on a fixed Gaussian kernel. In this paper, a generalized minimum error with fiducial points criterion (GMEEF) is presented by adopting the Generalized Gaussian Density (GGD) function as kernel. The GGD extends the Gaussian distribution by introducing a shape parameter that provides more control over the tail behavior and peakedness. In addition, due to the high computational complexity of GMEEF criterion, the quantized idea is introduced to notably lower the computational load of the GMEEF-type algorithm. Finally, the proposed criterions are introduced to the domains of adaptive filter, kernel recursive algorithm, and multilayer perceptron. Several numerical simulations, which contain system identification, acoustic echo cancellation, times series prediction, and supervised classification, indicate that the novel algorithms' performance performs excellently.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Authors:
Huaibo Zhao,
Yosuke Higuchi,
Yusuke Kida,
Tetsuji Ogawa,
Tetsunori Kobayashi
Abstract:
Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipate…
▽ More
Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction
Authors:
Zhenghong Zhou,
Huangxuan Zhao,
Jiemin Fang,
Dongqiao Xiang,
Lei Chen,
Lingxia Wu,
Feihong Wu,
Wenyu Liu,
Chuansheng Zheng,
Xinggang Wang
Abstract:
Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiatio…
▽ More
Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiation dose. To address this high radiation issue, we propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA reconstruction, which paves the way for high-quality 4D imaging. Additionally, 2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation properties from both spatial and temporal dimensions. It is optimized by minimizing discrepancies between the rendered images and sparse 2D DSA images. Without any neural network involved, TiAVox enjoys specific physical interpretability. The parameters of each learnable voxel represent the attenuation coefficients. We validated the TiAVox approach on both clinical and simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for novel view synthesis using only 30 views on the clinically sourced dataset, whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly, with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32 for novel view synthesis and 41.40 for 3D reconstruction. We also executed ablation studies to corroborate the essential components of TiAVox. The code will be publically available.
△ Less
Submitted 19 December, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Design and Control of a Bio-inspired Wheeled Bipedal Robot
Authors:
Haizhou Zhao,
Lei Yu,
Siying Qin,
Gumin Jin,
Yuqing Chen
Abstract:
Wheeled bipedal robots (WBRs) have the capability to execute agile and versatile locomotion tasks. This paper focuses on improving the dynamic performance of WBRs through innovations in both hardware and software development. Inspired by the human barbell squat, a bionic mechanical design is proposed and implemented as shown in Fig. 1. It distributes the weight onto its hip and knee joints to impr…
▽ More
Wheeled bipedal robots (WBRs) have the capability to execute agile and versatile locomotion tasks. This paper focuses on improving the dynamic performance of WBRs through innovations in both hardware and software development. Inspired by the human barbell squat, a bionic mechanical design is proposed and implemented as shown in Fig. 1. It distributes the weight onto its hip and knee joints to improve the effectiveness of joint motors while maintaining a relatively large workspace of the base link. Meanwhile, a novel model-based controller is devised, synthesizing height-variable wheeled linear inverted pendulum (HV-wLIP) model, Control Lyapunov Function (CLF) and whole-body dynamics for theoretically guaranteed stability and efficient computation. Compared with other alternatives, as a more accurate approximation of the WBR dynamics, the HV-wLIP can enable more agile response and provide theory basis for WBR controller design. Experimental results demonstrate that the robot could perform human-like deep squat, and is capable of maintaining tracking CoM velocity while manipulating base states. Furthermore, it exhibited robustness against external disturbances and unknown terrains even in the wild.
△ Less
Submitted 16 July, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
Authors:
Running Zhao,
Jiangtao Yu,
Hang Zhao,
Edith C. H. Ngai
Abstract:
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming a…
▽ More
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Striking The Right Balance: Three-Dimensional Ocean Sound Speed Field Reconstruction Using Tensor Neural Networks
Authors:
Siyuan Li,
Lei Cheng,
Ting Zhang,
Hangfang Zhao,
Jianlong Li
Abstract:
Accurately reconstructing a three-dimensional ocean sound speed field (3D SSF) is essential for various ocean acoustic applications, but the sparsity and uncertainty of sound speed samples across a vast ocean region make it a challenging task. To tackle this challenge, a large body of reconstruction methods has been developed, including spline interpolation, matrix/tensor-based completion, and dee…
▽ More
Accurately reconstructing a three-dimensional ocean sound speed field (3D SSF) is essential for various ocean acoustic applications, but the sparsity and uncertainty of sound speed samples across a vast ocean region make it a challenging task. To tackle this challenge, a large body of reconstruction methods has been developed, including spline interpolation, matrix/tensor-based completion, and deep neural networks-based reconstruction. However, a principled analysis of their effectiveness in 3D SSF reconstruction is still lacking. This paper performs a thorough analysis of the reconstruction error and highlights the need for a balanced representation model that integrates both expressiveness and conciseness. To meet this requirement, a 3D SSF-tailored tensor deep neural network is proposed, which utilizes tensor computations and deep neural network architectures to achieve remarkable 3D SSF reconstruction. The proposed model not only includes the previous tensor-based SSF representation model as a special case, but also has a natural ability to reject noise. The numerical results using the South China Sea 3D SSF data demonstrate that the proposed method outperforms state-of-the-art methods. The code is available at https://github.com/OceanSTARLab/Tensor-Neural-Network.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.