Search | arXiv e-print repository

Performance Analysis of Pair-wise Symbol Detection in Uplink NOMA-ISaC Systems

Authors: Haofeng Liu, Emad Alsusa, Arafat Al-Dweik

Abstract: This paper investigates the bit error rate (BER) and outage probability performance of integrated sensing and communication (ISaC) in uplink non-orthogonal multiple access (NOMA) based Internet of Things (IoT) systems. Specifically, we consider an ISaC system where the radar signal is designed to be orthogonal to the communication signal over two symbol periods so that its interference on the comm… ▽ More This paper investigates the bit error rate (BER) and outage probability performance of integrated sensing and communication (ISaC) in uplink non-orthogonal multiple access (NOMA) based Internet of Things (IoT) systems. Specifically, we consider an ISaC system where the radar signal is designed to be orthogonal to the communication signal over two symbol periods so that its interference on the communication signal is completely eliminated when detecting the data in pairs of consecutive symbols. This is akin to multi-symbol rate NOMA systems except in this case as the radar bears no data, its waveform is manipulated to be orthogonal to the transmitted communication signal. To eliminate potential decision ambiguity during the pair-wise data detection, a constant phase-offset between adjacent communication symbols is applied at the transmitter. The performance of such a system is analyzed through deriving analytical expressions for the exact BER of zero-forcing (ZF) based receivers. In addition, close-form expressions for the upper BER bound and the outage probability for both ZF and the joint maximum likelihood (JML) receivers are presented. The results show that the derived expressions are perfectly matched with the simulation results. The obtained expressions provide an insight into the performance of this novel ISaC system including demonstrating the impact of various parameters and showing how the ZF receiver provides a useful trade-off between performance and complexity relative to the JML receiver. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.15632 [pdf, other]

Structural Optimization of Lightweight Bipedal Robot via SERL

Authors: Yi Cheng, Chenxi Han, Yuheng Min, Linqi Ye, Houde Liu, Hang Liu

Abstract: Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent… ▽ More Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent performance potential of robots. In this context, this paper introduces the SERL (Structure Evolution Reinforcement Learning) algorithm, which combines reinforcement learning for locomotion tasks with evolution algorithms. The aim is to identify the optimal parameter combinations within a given multidimensional design space. Through the SERL algorithm, we successfully designed a bipedal robot named Wow Orin, where the optimal leg length are obtained through optimization based on body structure and motor torque. We have experimentally validated the effectiveness of the SERL algorithm, which is capable of optimizing the best structure within specified design space and task conditions. Additionally, to assess the performance gap between our designed robot and the current state-of-the-art robots, we compared Wow Orin with mainstream bipedal robots Cassie and Unitree H1. A series of experimental results demonstrate the Outstanding energy efficiency and performance of Wow Orin, further validating the feasibility of applying the SERL algorithm to practical design. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.14585 [pdf, other]

Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities

Authors: Yidi Li, Yihan Li, Yixin Guo, Bin Ren, Zhenhuan Xu, Hao Guo, Hong Liu, Nicu Sebe

Abstract: In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of… ▽ More In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: Audio-Visual Speaker Tracking with Incomplete Modalities

arXiv:2408.09357 [pdf, other]

Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation

Authors: Xukun Zhou, Fengxin Li, Ziqiao Peng, Kejian Wu, Jun He, Biao Qin, Zhaoxin Fan, Hongyan Liu

Abstract: Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously… ▽ More Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results. △ Less

Submitted 18 August, 2024; originally announced August 2024.

arXiv:2408.07931 [pdf, other]

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Authors: Haofeng Liu, Erli Zhang, Junde Wu, Mingxuan Hong, Yueming Jin

Abstract: Surgical video segmentation is a critical task in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has shown superior advancements in image and video segmentation. However, SAM2 struggles with efficiency due to the high computational demands of processing high-resolution images and complex and long-r… ▽ More Surgical video segmentation is a critical task in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has shown superior advancements in image and video segmentation. However, SAM2 struggles with efficiency due to the high computational demands of processing high-resolution images and complex and long-range temporal dynamics in surgical videos. To address these challenges, we introduce Surgical SAM 2 (SurgSAM-2), an advanced model to utilize SAM2 with an Efficient Frame Pruning (EFP) mechanism, to facilitate real-time surgical video segmentation. The EFP mechanism dynamically manages the memory bank by selectively retaining only the most informative frames, reducing memory usage and computational cost while maintaining high segmentation accuracy. Our extensive experiments demonstrate that SurgSAM-2 significantly improves both efficiency and segmentation accuracy compared to the vanilla SAM2. Remarkably, SurgSAM-2 achieves a 3$\times$ FPS compared with SAM2, while also delivering state-of-the-art performance after fine-tuning with lower-resolution data. These advancements establish SurgSAM-2 as a leading model for surgical video analysis, making real-time surgical video segmentation in resource-constrained environments a feasible reality. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: 16 pages, 2 figures

arXiv:2408.07919 [pdf, other]

doi 10.1145/3664647.3681145

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Authors: Yiming Li, Zhifang Guo, Xiangdong Wang, Hong Liu

Abstract: Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be… ▽ More Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features. Based on it, a locality-aware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment. Experiments on eleven zero-shot coarse- and fine-grained tasks suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: ACM MM 2024 (Oral)

arXiv:2408.06790 [pdf, other]

Residual Deep Reinforcement Learning for Inverter-based Volt-Var Control

Authors: Qiong Liu, Ye Guo, Lirong Deng, Haotian Liu, Dongyu Li, Hongbin Sun

Abstract: A residual deep reinforcement learning (RDRL) approach is proposed by integrating DRL with model-based optimization for inverter-based volt-var control in active distribution networks when the accurate power flow model is unknown. RDRL learns a residual action with a reduced residual action space, based on the action of the model-based approach with an approximate model. RDRL inherits the control… ▽ More A residual deep reinforcement learning (RDRL) approach is proposed by integrating DRL with model-based optimization for inverter-based volt-var control in active distribution networks when the accurate power flow model is unknown. RDRL learns a residual action with a reduced residual action space, based on the action of the model-based approach with an approximate model. RDRL inherits the control capability of the approximate-model-based optimization and enhances the policy optimization capability by residual policy learning. Additionally, it improves the approximation accuracy of the critic and reduces the search difficulties of the actor by reducing residual action space. To address the issues of "too small" or "too large" residual action space of RDRL and further improve the optimization performance, we extend RDRL to a boosting RDRL approach. It selects a much smaller residual action space and learns a residual policy by using the policy of RDRL as a base policy. Simulations demonstrate that RDRL and boosting RDRL improve the optimization performance considerably throughout the learning stage and verify their rationales point-by-point, including 1) inheriting the capability of the approximate model-based optimization, 2) residual policy learning, and 3) learning in a reduced action space. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: arXiv admin note: text overlap with arXiv:2210.07360

arXiv:2408.05777 [pdf, other]

Seg-CycleGAN : SAR-to-optical image translation guided by a downstream task

Authors: Hannuo Zhang, Huihui Li, Jiarui Lin, Yujie Zhang, Jianghua Fan, Hang Liu

Abstract: Optical remote sensing and Synthetic Aperture Radar(SAR) remote sensing are crucial for earth observation, offering complementary capabilities. While optical sensors provide high-quality images, they are limited by weather and lighting conditions. In contrast, SAR sensors can operate effectively under adverse conditions. This letter proposes a GAN-based SAR-to-optical image translation method name… ▽ More Optical remote sensing and Synthetic Aperture Radar(SAR) remote sensing are crucial for earth observation, offering complementary capabilities. While optical sensors provide high-quality images, they are limited by weather and lighting conditions. In contrast, SAR sensors can operate effectively under adverse conditions. This letter proposes a GAN-based SAR-to-optical image translation method named Seg-CycleGAN, designed to enhance the accuracy of ship target translation by leveraging semantic information from a pre-trained semantic segmentation model. Our method utilizes the downstream task of ship target semantic segmentation to guide the training of image translation network, improving the quality of output Optical-styled images. The potential of foundation-model-annotated datasets in SAR-to-optical translation tasks is revealed. This work suggests broader research and applications for downstream-task-guided frameworks. The code will be available at https://github.com/NPULHH/ △ Less

Submitted 11 August, 2024; originally announced August 2024.

Comments: 8 pages, 5 figures

arXiv:2408.04912 [pdf, other]

doi 10.1145/3675094.3678488

AcousAF: Acoustic Sensing-Based Atrial Fibrillation Detection System for Mobile Phones

Authors: Xuanyu Liu, Haoxian Liu, Jiao Li, Zongqi Yang, Yi Huang, Jin Zhang

Abstract: Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these d… ▽ More Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution. However, these systems have various applicability issues, such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present AcousAF, a novel AF detection system based on acoustic sensors of smartphones. Particularly, we explore the potential of pulse wave acquisition from the wrist using smartphone speakers and microphones. In addition, we propose a well-designed framework comprised of pulse wave probing, pulse wave extraction, and AF detection to ensure accurate and reliable AF detection. We collect data from 20 participants utilizing our custom data collection application on the smartphone. Extensive experimental results demonstrate the high performance of our system, with 92.8% accuracy, 86.9% precision, 87.4% recall, and 87.1% F1 Score. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: Accepted for publication in Companion of the 2024 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp Companion '24)

arXiv:2408.04837 [pdf, ps, other]

Multi-User MISO with Stacked Intelligent Metasurfaces: A DRL-Based Sum-Rate Optimization Approach

Authors: Hao Liu, Jiancheng An, George C. Alexandropoulos, Derrick Wing Kwan Ng, Chau Yuen, Lu Gan

Abstract: Stacked intelligent metasurfaces (SIMs) represent a novel signal processing paradigm that enables over-the-air processing of electromagnetic waves at the speed of light. Their multi-layer architecture exhibits customizable computational capabilities compared to conventional single-layer reconfigurable intelligent surfaces and metasurface lenses. In this paper, we deploy SIM to improve the performa… ▽ More Stacked intelligent metasurfaces (SIMs) represent a novel signal processing paradigm that enables over-the-air processing of electromagnetic waves at the speed of light. Their multi-layer architecture exhibits customizable computational capabilities compared to conventional single-layer reconfigurable intelligent surfaces and metasurface lenses. In this paper, we deploy SIM to improve the performance of multi-user multiple-input single-output (MISO) wireless systems through a low complexity manner with reduced numbers of transmit radio frequency chains. In particular, an optimization formulation for the joint design of the SIM phase shifts and the transmit power allocation is presented, which is efficiently tackled via a customized deep reinforcement learning (DRL) approach that systematically explores pre-designed states of the SIM-parametrized smart wireless environment. The presented performance evaluation results demonstrate the proposed method's capability to effectively learn from the wireless environment, while consistently outperforming conventional precoding schemes under low transmit power conditions. Furthermore, the implementation of hyperparameter tuning and whitening process significantly enhance the robustness of the proposed DRL framework. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 34 pages, 11 figures, 2 tables. arXiv admin note: text overlap with arXiv:2402.09006

arXiv:2408.04777 [pdf]

Deep Learning-based Unsupervised Domain Adaptation via a Unified Model for Prostate Lesion Detection Using Multisite Bi-parametric MRI Datasets

Authors: Hao Li, Han Liu, Heinrich von Busch, Robert Grimm, Henkjan Huisman, Angela Tong, David Winkel, Tobias Penzkofer, Ivan Shabunin, Moon Hyung Choi, Qingsong Yang, Dieter Szolar, Steven Shea, Fergus Coakley, Mukesh Harisinghani, Ipek Oguz, Dorin Comaniciu, Ali Kamen, Bin Lou

Abstract: Our hypothesis is that UDA using diffusion-weighted images, generated with a unified model, offers a promising and reliable strategy for enhancing the performance of supervised learning models in multi-site prostate lesion detection, especially when various b-values are present. This retrospective study included data from 5,150 patients (14,191 samples) collected across nine different imaging cent… ▽ More Our hypothesis is that UDA using diffusion-weighted images, generated with a unified model, offers a promising and reliable strategy for enhancing the performance of supervised learning models in multi-site prostate lesion detection, especially when various b-values are present. This retrospective study included data from 5,150 patients (14,191 samples) collected across nine different imaging centers. A novel UDA method using a unified generative model was developed for multi-site PCa detection. This method translates diffusion-weighted imaging (DWI) acquisitions, including apparent diffusion coefficient (ADC) and individual DW images acquired using various b-values, to align with the style of images acquired using b-values recommended by Prostate Imaging Reporting and Data System (PI-RADS) guidelines. The generated ADC and DW images replace the original images for PCa detection. An independent set of 1,692 test cases (2,393 samples) was used for evaluation. The area under the receiver operating characteristic curve (AUC) was used as the primary metric, and statistical analysis was performed via bootstrapping. For all test cases, the AUC values for baseline SL and UDA methods were 0.73 and 0.79 (p<.001), respectively, for PI-RADS>=3, and 0.77 and 0.80 (p<.001) for PI-RADS>=4 PCa lesions. In the 361 test cases under the most unfavorable image acquisition setting, the AUC values for baseline SL and UDA were 0.49 and 0.76 (p<.001) for PI-RADS>=3, and 0.50 and 0.77 (p<.001) for PI-RADS>=4 PCa lesions. The results indicate the proposed UDA with generated images improved the performance of SL methods in multi-site PCa lesion detection across datasets with various b values, especially for images acquired with significant deviations from the PI-RADS recommended DWI protocol (e.g. with an extremely high b-value). △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: Accept at Radiology: Artificial Intelligence. Journal reference and external DOI will be added once published

arXiv:2408.02586 [pdf, other]

Massive MIMO-OTFS-Based Random Access for Cooperative LEO Satellite Constellations

Authors: Boxiao Shen, Yongpeng Wu, Shiqi Gong, Heng Liu, Björn Ottersten, Wenjun Zhang

Abstract: This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, t… ▽ More This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, thereby reducing the pilot overhead. By exploiting the block sparsity of the TSLs in the angular domain, a message passing algorithm is designed for initial channel estimation. Subsequently, we examine two cooperative modes to leverage the spatial diversity within satellite constellations: the centralized mode, where computations are performed at a high-power central server, and the distributed mode, where computations are offloaded to edge satellites with minimal signaling overhead. Specifically, in the centralized mode, device identification is achieved by aggregating backhaul information from edge satellites, and channel estimation and symbol detection are jointly enhanced through a structured approximate expectation propagation (AEP) algorithm. In the distributed mode, edge satellites share channel information and exchange soft information about data symbols, leading to a distributed version of AEP. The introduced basis expansion model for TSLs enables the efficient implementation of both centralized and distributed algorithms via fast Fourier transform. Simulation results demonstrate that proposed schemes significantly outperform conventional algorithms in terms of the activity error rate, the normalized mean squared error, and the symbol error rate. Notably, the distributed mode achieves performance comparable to the centralized mode with only two exchanges of soft information about data symbols within the constellation. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: This paper has been accepted by IEEE Journal on Selected Areas in Communications

arXiv:2407.21394 [pdf, other]

Force Sensing Guided Artery-Vein Segmentation via Sequential Ultrasound Images

Authors: Yimeng Geng, Gaofeng Meng, Mingcong Chen, Guanglin Cao, Mingyang Zhao, Jianbo Zhao, Hongbin Liu

Abstract: Accurate identification of arteries and veins in ultrasound images is crucial for vascular examinations and interventions in robotics-assisted surgeries. However, current methods for ultrasound vessel segmentation face challenges in distinguishing between arteries and veins due to their morphological similarities. To address this challenge, this study introduces a novel force sensing guided segmen… ▽ More Accurate identification of arteries and veins in ultrasound images is crucial for vascular examinations and interventions in robotics-assisted surgeries. However, current methods for ultrasound vessel segmentation face challenges in distinguishing between arteries and veins due to their morphological similarities. To address this challenge, this study introduces a novel force sensing guided segmentation approach to enhance artery-vein segmentation accuracy by leveraging their distinct deformability. Our proposed method utilizes force magnitude to identify key frames with the most significant vascular deformation in a sequence of ultrasound images. These key frames are then integrated with the current frame through attention mechanisms, with weights assigned in accordance with force magnitude. Our proposed force sensing guided framework can be seamlessly integrated into various segmentation networks and achieves significant performance improvements in multiple U-shaped networks such as U-Net, Swin-unet and Transunet. Furthermore, we contribute the first multimodal ultrasound artery-vein segmentation dataset, Mus-V, which encompasses both force and image data simultaneously. The dataset comprises 3114 ultrasound images of carotid and femoral vessels extracted from 105 videos, with corresponding force data recorded by the force sensor mounted on the US probe. Our code and dataset will be publicly available. △ Less

Submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.20622 [pdf, other]

Decoding Linguistic Representations of Human Brain

Authors: Yu Wang, Heyang Liu, Yuhao Wang, Chuan Xuan, Yixuan Hou, Sheng Feng, Hongcheng Liu, Yusheng Liao, Yanfeng Wang

Abstract: Language, as an information medium created by advanced organisms, has always been a concern of neuroscience regarding how it is represented in the brain. Decoding linguistic representations in the evoked brain has shown groundbreaking achievements, thanks to the rapid improvement of neuroimaging, medical technology, life sciences and artificial intelligence. In this work, we present a taxonomy of… ▽ More Language, as an information medium created by advanced organisms, has always been a concern of neuroscience regarding how it is represented in the brain. Decoding linguistic representations in the evoked brain has shown groundbreaking achievements, thanks to the rapid improvement of neuroimaging, medical technology, life sciences and artificial intelligence. In this work, we present a taxonomy of brain-to-language decoding of both textual and speech formats. This work integrates two types of research: neuroscience focusing on language understanding and deep learning-based brain decoding. Generating discernible language information from brain activity could not only help those with limited articulation, especially amyotrophic lateral sclerosis (ALS) patients but also open up a new way for the next generation's brain-computer interface (BCI). This article will help brain scientists and deep-learning researchers to gain a bird's eye view of fine-grained language perception, and thus facilitate their further investigation and research of neural process and language decoding. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2407.17944 [pdf, other]

Time-Optimal Planning for Long-Range Quadrotor Flights: An Automatic Optimal Synthesis Approach

Authors: Chao Qin, Jingxiang Chen, Yifan Lin, Abhishek Goudar, Angela P. Schoellig, Hugh H. -T. Liu

Abstract: Time-critical tasks such as drone racing typically cover large operation areas. However, it is difficult and computationally intensive for current time-optimal motion planners to accommodate long flight distances since a large yet unknown number of knot points is required to represent the trajectory. We present a polynomial-based automatic optimal synthesis (AOS) approach that can address this cha… ▽ More Time-critical tasks such as drone racing typically cover large operation areas. However, it is difficult and computationally intensive for current time-optimal motion planners to accommodate long flight distances since a large yet unknown number of knot points is required to represent the trajectory. We present a polynomial-based automatic optimal synthesis (AOS) approach that can address this challenge. Our method not only achieves superior time optimality but also maintains a consistently low computational cost across different ranges while considering the full quadrotor dynamics. First, we analyze the properties of time-optimal quadrotor maneuvers to determine the minimal number of polynomial pieces required to capture the dominant structure of time-optimal trajectories. This enables us to represent substantially long minimum-time trajectories with a minimal set of variables. Then, a robust optimization scheme is developed to handle arbitrary start and end conditions as well as intermediate waypoints. Extensive comparisons show that our approach is faster than the state-of-the-art approach by orders of magnitude with comparable time optimality. Real-world experiments further validate the quality of the resulting trajectories, demonstrating aggressive time-optimal maneuvers with a peak velocity of 8.86 m/s. △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: 19 pages, 19 figures

arXiv:2407.14329 [pdf, other]

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Authors: Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley

Abstract: Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with… ▽ More Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times faster\footnote{An online demo is available at \url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}}. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: Interspeech 2024

arXiv:2407.13220 [pdf, other]

MEDIC: Zero-shot Music Editing with Disentangled Inversion Control

Authors: Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Jiayang Xu, Zhou Zhao

Abstract: Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accu… ▽ More Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accumulation of errors across each diffusion step, undermining its efficacy. And the lack of attention control hinders the fine-grained manipulations of music. To counteract these limitations, we introduce the \textit{Disentangled Inversion} technique, which is designed to disentangle the diffusion process into triple branches, thereby magnifying their individual capabilities for both precise editing and preservation. Furthermore, we propose the \textit{Harmonized Attention Control} framework, which unifies the mutual self-attention and cross-attention with an additional Harmonic Branch to achieve the desired composition and structural information in the target music. Collectively, these innovations comprise the \textit{Disentangled Inversion Control (DIC)} framework, enabling accurate music editing whilst safeguarding structural integrity. To benchmark audio editing efficacy, we introduce \textit{ZoME-Bench}, a comprehensive music editing benchmark hosting 1,100 samples spread across 10 distinct editing categories, which facilitates both zero-shot and instruction-based music editing tasks. Our method demonstrates unparalleled performance in edit fidelity and essential content preservation, outperforming contemporary state-of-the-art inversion techniques. △ Less

Submitted 20 August, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.03566 [pdf, ps, other]

Stacked Intelligent Metasurfaces for Wireless Sensing and Communication: Applications and Challenges

Authors: Hao Liu, Jiancheng An, Xing Jia, Shining Lin, Xianghao Yao, Lu Gan, Bruno Clerckx, Chau Yuen, Mehdi Bennis, Mérouane Debbah

Abstract: The rapid advancement of wireless communication technologies has precipitated an unprecedented demand for high data rates, extremely low latency, and ubiquitous connectivity. In order to achieve these goals, stacked intelligent metasurfaces (SIM) has been developed as a novel solution to perform advanced signal processing tasks directly in the electromagnetic wave domain, thus achieving ultra-fast… ▽ More The rapid advancement of wireless communication technologies has precipitated an unprecedented demand for high data rates, extremely low latency, and ubiquitous connectivity. In order to achieve these goals, stacked intelligent metasurfaces (SIM) has been developed as a novel solution to perform advanced signal processing tasks directly in the electromagnetic wave domain, thus achieving ultra-fast computing speed and reducing hardware complexity. This article provides an overview of the SIM technology by discussing its hardware architectures, advantages, and potential applications for wireless sensing and communication. Specifically, we explore the utilization of SIMs in enabling wave-domain beamforming, channel modeling and estimation in SIM-assisted communication systems. Furthermore, we elaborate on the potential of utilizing a SIM to build a hybrid optical-electronic neural network (HOENN) and demonstrate its efficacy by examining two case studies: disaster monitoring and direction-of-arrival estimation. Finally, we identify key implementation challenges, including practical hardware imperfections, efficient SIM configuration for realizing wave-domain signal processing, and performance analysis to motivate future research on this important and far-reaching topic. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 8 pages, 5 figures, 1 table

arXiv:2407.03374 [pdf]

An Outline of Prognostics and Health Management Large Model: Concepts, Paradigms, and Challenges

Authors: Laifa Tao, Shangyu Li, Haifei Liu, Qixuan Huang, Liang Ma, Guoao Ning, Yiling Chen, Yunlong Wu, Bin Li, Weiwei Zhang, Zhengduo Zhao, Wenchao Zhan, Wenyan Cao, Chao Wang, Hongmei Liu, Jian Ma, Mingliang Suo, Yujie Cheng, Yu Ding, Dengwei Song, Chen Lu

Abstract: Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Larg… ▽ More Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Large Model, heralds a technological revolution with the potential to fundamentally reshape traditional technological fields and human production methods. Its capabilities, including strong generalization, reasoning, and generative attributes, present opportunities to address PHM's bottlenecks. To this end, based on a systematic analysis of the current challenges and bottlenecks in PHM, as well as the research status and advantages of Large Model, we propose a novel concept and three progressive paradigms of Prognosis and Health Management Large Model (PHM-LM) through the integration of the Large Model with PHM. Subsequently, we provide feasible technical approaches for PHM-LM to bolster PHM's core capabilities within the framework of the three paradigms. Moreover, to address core issues confronting PHM, we discuss a series of technical challenges of PHM-LM throughout the entire process of construction and application. This comprehensive effort offers a holistic PHM-LM technical framework, and provides avenues for new PHM technologies, methodologies, tools, platforms and applications, which also potentially innovates design, research & development, verification and application mode of PHM. And furthermore, a new generation of PHM with AI will also capably be realized, i.e., from custom to generalized, from discriminative to generative, and from theoretical conditions to practical applications. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.03188 [pdf, other]

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Authors: Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang

Abstract: Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquia… ▽ More Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations. △ Less

Submitted 10 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

Comments: 19 pages, 5 figures

MSC Class: 68Txx(Primary)14F05; 91Fxx(Secondary) ACM Class: I.2.7; J.5

arXiv:2407.01421 [pdf]

C-MP: A decentralized adaptive-coordinated traffic signal control using the Max Pressure framework

Authors: Tanveer Ahmed, Hao Liu, Vikash V. Gayah

Abstract: Coordinated traffic signals seek to provide uninterrupted flow through a series of closely spaced intersections, typically using pre-defined fixed signal timings and offsets. Adaptive traffic signals dynamically change signal timings based on observed traffic conditions in a way that might disrupt coordinated movements, particularly when these decisions are made independently at each intersection.… ▽ More Coordinated traffic signals seek to provide uninterrupted flow through a series of closely spaced intersections, typically using pre-defined fixed signal timings and offsets. Adaptive traffic signals dynamically change signal timings based on observed traffic conditions in a way that might disrupt coordinated movements, particularly when these decisions are made independently at each intersection. To alleviate this issue, this paper introduces a novel Max Pressure-based traffic signal framework that can provide coordination even under decentralized decision-making. The proposed Coordinated Max Pressure (C-MP) algorithm uses the space mean speeds of vehicles to explicitly detect freely flowing platoons of vehicles and prioritizes their movement along a corridor. Specifically, upstream platoons are detected and their weight in the MP framework increased to provide priority, while downstream platoons are detected and their weight reduced to ensure smooth traffic flow across corridors. The study analytically proves that C-MP maintains the desirable maximum stability property, while micro-simulation analyses conducted on an arterial network demonstrate its ability to achieve a larger stable region compared to benchmark MP control policies. Simulation results also reveal that the proposed control algorithm can effectively coordinate traffic signals in both directions along an arterial without explicitly assigned offsets or constraints. The results also reveal C-MP's superiority to benchmark coordination strategies in reducing travel time, and fuel consumption both at the corridor level and the network level by balancing the negative impact imparted to vehicles in the minor direction. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Submitted to Transportation Research Part C: Emerging Technologies

arXiv:2407.00579 [pdf, ps, other]

Active-RIS-Aided Covert Communications in NOMA-Inspired ISAC Wireless Systems

Authors: Miaomiao Zhu, Pengxu Chen, Liang Yang, Alexandros-Apostolos A. Boulogeorgos, Theodoros A. Tsiftsis, Hongwu Liu

Abstract: Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim o… ▽ More Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim of maximizing the covert rate. Specifically, a dual-function base-station (BS) transmits the superposition signal to sense multiple targets, while achieving covert and reliable communications for a pair of NOMA covert and public users, respectively, in the presence of a warden. Two superposition transmission schemes, namely, the transmissions with dedicated sensing signal (w-DSS) and without dedicated sensing signal (w/o-DSS), are respectively considered in the formulations of the joint transmission and reflection beamforming optimization problems. Numerical results demonstrate that active-RIS-aided NOMA-ISAC system outperforms the passive-RIS-aided and without-RIS counterparts in terms of covert rate and trade-off between covert communication and sensing performance metrics. Finally, the w/o-DSS scheme, which omits the dedicated sensing signal, achieves a higher covert rate than the w-DSS scheme by allocating more transmit power for the covert transmissions, while preserving a comparable multi-target sensing performance. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19305 [pdf, other]

A Max Pressure Algorithm for Traffic Signals Considering Pedestrian Queues

Authors: Hao Liu, Vikash V. Gayah, Michael Levin

Abstract: This paper proposes a novel max-pressure (MP) algorithm that incorporates pedestrian traffic into the MP control architecture. Pedestrians are modeled as being included in one of two groups: those walking on sidewalks and those queued at intersections waiting to cross. Traffic dynamics models for both groups are developed. Under the proposed control policy, the signal timings are determined based… ▽ More This paper proposes a novel max-pressure (MP) algorithm that incorporates pedestrian traffic into the MP control architecture. Pedestrians are modeled as being included in one of two groups: those walking on sidewalks and those queued at intersections waiting to cross. Traffic dynamics models for both groups are developed. Under the proposed control policy, the signal timings are determined based on the queue length of both vehicles and pedestrians waiting to cross the intersection. The proposed algorithm maintains the decentralized control structure, and the paper proves that it also exhibits the maximum stability property for both vehicles and pedestrians. Microscopic traffic simulation results demonstrate that the proposed model can improve the overall operational efficiency -- i.e., reduce person travel delays -- under various vehicle demand levels compared to the original queue-based MP (Q-MP) algorithm and a recently developed rule-based MP algorithm considering pedestrians. The Q-MP ignores the yielding behavior of right-turn vehicles to conflicting pedestrian movements, which leads to high delay for vehicles. On the other hand, the delay incurred by pedestrians is high from the rule-based model since it imposes large waiting time tolerance to guarantee the operational efficiency of vehicles. The proposed algorithm outperforms both models since the states of both vehicles and pedestrians are taken into consideration to determine signal timings. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.19269 [pdf, other]

doi 10.1016/j.trc.2024.104795

OCC-MP: A Max-Pressure framework to prioritize transit and high occupancy vehicles

Authors: Tanveer Ahmed, Hao Liu, Vikash V. Gayah

Abstract: Max-pressure (MP) is a decentralized adaptive traffic signal control approach that has been shown to maximize throughput for private vehicles. However, MP-based signal control algorithms do not differentiate the movement of transit vehicles from private vehicles or between high and single-occupancy private vehicles. Prioritizing the movement of transit or other high occupancy vehicles (HOVs) is vi… ▽ More Max-pressure (MP) is a decentralized adaptive traffic signal control approach that has been shown to maximize throughput for private vehicles. However, MP-based signal control algorithms do not differentiate the movement of transit vehicles from private vehicles or between high and single-occupancy private vehicles. Prioritizing the movement of transit or other high occupancy vehicles (HOVs) is vital to reduce congestion and improve the reliability and efficiency of transit operations. This study proposes OCC-MP: a novel MP-based algorithm that considers both vehicle queues and passenger occupancies in computing the weights of movements. By weighing movements with higher passenger occupancies more heavily, transit and other HOVs are implicitly provided with priority, while accounting for any negative impacts of that priority on single occupancy vehicles. And, unlike rule-based transit signal priority (TSP) strategies, OCC-MP more naturally also accommodates conflicting transit routes at a signalized intersection and facilitates their movement, even in mixed traffic without dedicated lanes. Simulations on a grid network under varying demands and transit configurations demonstrate the effectiveness of OCC-MP at providing TSP while simultaneously reducing the negative impact imparted onto lower occupancy private vehicles. Furthermore, OCC-MP is shown to have a larger stable region for demand compared to rule-based TSP strategies integrated into the MP framework. The performance of OCC-MP is also shown to be robust to errors in passenger occupancy information from transit vehicles and can be applied when passenger occupancies of private vehicles are not available. Finally, OCC-MP can be applied in a partially connected vehicle (CV) environment when a subset of vehicles is able to provide information to the signal controller, outperforming baseline methods at low CV penetration rates. △ Less

Submitted 4 August, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: This paper will be published in Transportation Research Part C: Emerging Technologies

Journal ref: Transportation Research Part C: Emerging Technologies 166 (2024): 104795

arXiv:2406.17800 [pdf, other]

Fish Tracking, Counting, and Behaviour Analysis in Digital Aquaculture: A Comprehensive Review

Authors: Meng Cui, Xubo Liu, Haohe Liu, Jinzheng Zhao, Daoliang Li, Wenwu Wang

Abstract: Digital aquaculture leverages advanced technologies and data-driven methods, providing substantial benefits over traditional aquaculture practices. Fish tracking, counting, and behaviour analysis are crucial components of digital aquaculture, which are essential for optimizing production efficiency, enhancing fish welfare, and improving resource management. Previous reviews have focused on single… ▽ More Digital aquaculture leverages advanced technologies and data-driven methods, providing substantial benefits over traditional aquaculture practices. Fish tracking, counting, and behaviour analysis are crucial components of digital aquaculture, which are essential for optimizing production efficiency, enhancing fish welfare, and improving resource management. Previous reviews have focused on single modalities, limiting their ability to address the diverse challenges encountered in these tasks comprehensively. This review provides a comprehensive analysis of the current state of aquaculture digital technologies, including vision-based, acoustic-based, and biosensor-based methods. We examine the advantages, limitations, and applications of these methods, highlighting recent advancements and identifying critical research gaps. The scarcity of comprehensive fish datasets and the lack of unified evaluation standards, which make it difficult to compare the performance of different technologies, are identified as major obstacles hindering progress in this field. To overcome current limitations and improve the accuracy, robustness, and efficiency of fish monitoring systems, we explore the potential of emerging technologies such as multimodal data fusion and deep learning. Additionally, we contribute to the field by providing a summary of existing datasets available for fish tracking, counting, and behaviour analysis. Future research directions are outlined, emphasizing the need for comprehensive datasets and evaluation standards to facilitate meaningful comparisons between technologies and promote their practical implementation in real-world aquaculture settings. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.16058 [pdf, other]

Text-Queried Target Sound Event Localization

Authors: Jinzheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

Abstract: Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the… ▽ More Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted by EUSIPCO 2024

arXiv:2406.15716 [pdf, other]

Predicting fluorescent labels in label-free microscopy images with pix2pix and adaptive loss in Light My Cells challenge

Authors: Han Liu, Hao Li, Jiacheng Wang, Yubo Fan, Zhoubing Xu, Ipek Oguz

Abstract: Fluorescence labeling is the standard approach to reveal cellular structures and other subcellular constituents for microscopy images. However, this invasive procedure may perturb or even kill the cells and the procedure itself is highly time-consuming and complex. Recently, in silico labeling has emerged as a promising alternative, aiming to use machine learning models to directly predict the flu… ▽ More Fluorescence labeling is the standard approach to reveal cellular structures and other subcellular constituents for microscopy images. However, this invasive procedure may perturb or even kill the cells and the procedure itself is highly time-consuming and complex. Recently, in silico labeling has emerged as a promising alternative, aiming to use machine learning models to directly predict the fluorescently labeled images from label-free microscopy. In this paper, we propose a deep learning-based in silico labeling method for the Light My Cells challenge. Built upon pix2pix, our proposed method can be trained using the partially labeled datasets with an adaptive loss. Moreover, we explore the effectiveness of several training strategies to handle different input modalities, such as training them together or separately. The results show that our method achieves promising performance for in silico labeling. Our code is available at https://github.com/MedICL-VU/LightMyCells. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14977 [pdf, other]

Trustworthy Enhanced Multi-view Multi-modal Alzheimer's Disease Prediction with Brain-wide Imaging Transcriptomics Data

Authors: Shan Cong, Zhoujie Fan, Hongwei Liu, Yinghan Zhang, Xin Wang, Haoran Luo, Xiaohui Yao

Abstract: Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities,… ▽ More Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at https://github.com/Yaolab-fantastic/TMM. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.13705 [pdf, other]

EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

Authors: Long Bai, Tong Chen, Qiaozhi Tan, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang Ren

Abstract: Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels rema… ▽ More Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DiT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DiT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance. △ Less

Submitted 8 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: To appear in MICCAI 2024. Code and dataset availability: https://github.com/longbai1006/EndoUIC

arXiv:2406.12596

Beyond Near-Field: Far-Field Location Division Multiple Access in Downlink MIMO Systems

Authors: Haoyan Liu, Caijian Jie, Min Yang, Chengguang Li

Abstract: Exploring channel dimensions has been the driving force behind breakthroughs in successive generations of mobile communication systems. In 5G, space division multiple access (SDMA) leveraging massive MIMO has been crucial in enhancing system capacity through spatial differentiation of users. However, SDMA can only finely distinguish users at adjacent angles in ultra-dense networks by extremely lar… ▽ More Exploring channel dimensions has been the driving force behind breakthroughs in successive generations of mobile communication systems. In 5G, space division multiple access (SDMA) leveraging massive MIMO has been crucial in enhancing system capacity through spatial differentiation of users. However, SDMA can only finely distinguish users at adjacent angles in ultra-dense networks by extremely large-scale antenna arrays. For a long time, most research has focused on the angle domain of the space, overlooking the potential of the distance domain. Near-field location division multiple access (LDMA) was proposed based on the beam-focusing effect yielded by near-field spherical propagation model, partitioning channel resources by both angle and distance. To achieve a similar idea in the far-field region, this paper introduces a far-field LDMA scheme for wideband systems based on orthogonal frequency division multiplexing (OFDM). Benefiting from frequency diverse arrays (FDA), it becomes possible to manipulate beams in the distance domain. Combined with OFDM, the inherent cyclic prefix ensures a complete OFDM symbol can be received without losing distance information, while the matched filter of OFDM helps eliminate the time-variance of FDA steering vectors. Theoretical and simulation results show that LDMA can fully exploit the additional degrees of freedom in the distance domain to significantly improve spectral efficiency, especially in narrow sector multiple access (MA) scenarios. Moreover, LDMA can maintain independence between array elements even in single-path channels, making it stand out in MA schemes at millimeter-wave and higher frequency bands. △ Less

Submitted 5 August, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: We have omitted an important detail of the baseband equivalent model, which may mislead the reader. We are currently trying to resolve this issue, please withdraw our submission

arXiv:2406.12254 [pdf, other]

Enhancing Single-Slice Segmentation with 3D-to-2D Unpaired Scan Distillation

Authors: Xin Yu, Qi Yang, Han Liu, Ho Hin Lee, Yucheng Tang, Lucas W. Remedios, Michael E. Kim, Rendong Zhang, Shunxing Bao, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman

Abstract: 2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmenta… ▽ More 2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmentation results. In this work, we propose a novel 3D-to-2D distillation framework, leveraging pre-trained 3D models to enhance 2D single-slice segmentation. Specifically, we extract the prediction distribution centroid from the 3D representations, to guide the 2D student by learning intra- and inter-class correlation. Unlike traditional knowledge distillation methods that require the same data input, our approach employs unpaired 3D CT scans with any contrast to guide the 2D student model. Experiments conducted on 707 subjects from the single-slice Baltimore Longitudinal Study of Aging (BLSA) dataset demonstrate that state-of-the-art 2D multi-organ segmentation methods can benefit from the 3D teacher model, achieving enhanced performance in single-slice multi-organ segmentation. Notably, our approach demonstrates considerable efficacy in low-data regimes, outperforming the model trained with all available training subjects even when utilizing only 200 training subjects. Thus, this work underscores the potential to alleviate manual annotation burdens. △ Less

Submitted 12 July, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11568 [pdf, other]

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

Authors: Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang

Abstract: In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade m… ▽ More In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade models. Our findings underscore the immense potential of E2E frameworks in speech neuroprosthesis, particularly as the technology behind brain-computer interfaces (BCIs) and the availability of relevant datasets continue to evolve. This work not only showcases the efficacy of combining LLMs with E2E decoding for enhancing speech neuroprosthesis but also sets a new direction for future research in BCI applications, underscoring the impact of LLMs in decoding complex neural signals for communication restoration. Code will be made available at https://github.com/FsFrancis15/BrainLLM. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11336 [pdf, other]

LFPLM: A General and Flexible Load Forecasting Framework based on Pre-trained Language Model

Authors: Mingyang Gao, Suyang Zhou, Wei Gu, Zhi Wu, Zijian Hu, Hong Zhu, Haiquan Liu

Abstract: Accurate load forecasting is essential for maintaining the power balance between generators and consumers, especially with the increasing integration of renewable energy sources, which introduce significant intermittent volatility. With the development of data-driven methods, machine learning and deep learning-based models have become the predominant approach for load forecasting tasks. In recent… ▽ More Accurate load forecasting is essential for maintaining the power balance between generators and consumers, especially with the increasing integration of renewable energy sources, which introduce significant intermittent volatility. With the development of data-driven methods, machine learning and deep learning-based models have become the predominant approach for load forecasting tasks. In recent years, pre-trained language models (PLMs) have made significant advancements, demonstrating superior performance in various fields. This paper proposes a load forecasting method based on PLMs, which offers not only accurate predictive ability but also general and flexible applicability. Additionally, a data modeling method is proposed to effectively transform load sequence data into natural language for PLM training. Furthermore, we introduce a data enhancement strategy that eliminate the impact of PLM hallucinations on forecasting results. The effectiveness of the proposed method has been validated on two real-world datasets. Compared with existing methods, our approach shows state-of-the-art performance across all validation metrics. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures and 5 tables

arXiv:2406.08837 [pdf]

Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

Authors: Houze Liu, Iris Li, Yaxin Liang, Dan Sun, Yining Yang, Haowei Yang

Abstract: Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networ… ▽ More Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networks. Firstly, according to the characteristics of pneumonia images, AlexNet and InceptionV3 were selected to obtain better image recognition results. Combining the features of medical images, the forward neural network with deeper and more complex structure is learned. Finally, knowledge extraction technology is used to extract the obtained data into the AlexNet model to achieve the purpose of improving computing efficiency and reducing computing costs. The results showed that the prediction accuracy, specificity, and sensitivity of the trained AlexNet model increased by 4.25 percentage points, 7.85 percentage points, and 2.32 percentage points, respectively. The graphics processing usage has decreased by 51% compared to the InceptionV3 mode. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.06979 [pdf, other]

AudioMarkBench: Benchmarking Robustness of Audio Watermarking

Authors: Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, Neil Zhenqiang Gong

Abstract: The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present A… ▽ More The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present AudioMarkBench, the first systematic benchmark for evaluating the robustness of audio watermarking against watermark removal and watermark forgery. AudioMarkBench includes a new dataset created from Common-Voice across languages, biological sexes, and ages, 3 state-of-the-art watermarking methods, and 15 types of perturbations. We benchmark the robustness of these methods against the perturbations in no-box, black-box, and white-box settings. Our findings highlight the vulnerabilities of current watermarking techniques and emphasize the need for more robust and fair audio watermarking solutions. Our dataset and code are publicly available at \url{https://github.com/moyangkuo/AudioMarkBench}. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06880 [pdf, other]

doi 10.1109/TIA.2024.3395570

Multi-Objective Sizing Optimization Method of Microgrid Considering Cost and Carbon Emissions

Authors: Xiang Zhu, Guangchun Ruan, Hua Geng, Honghai Liu, Mingfei Bai, Chao Peng

Abstract: Microgrid serves as a promising solution to integrate and manage distributed renewable energy resources. In this paper, we establish a stochastic multi-objective sizing optimization (SMOSO) model for microgrid planning, which fully captures the battery degradation characteristics and the total carbon emissions. The microgrid operator aims to simultaneously maximize the economic benefits and minimi… ▽ More Microgrid serves as a promising solution to integrate and manage distributed renewable energy resources. In this paper, we establish a stochastic multi-objective sizing optimization (SMOSO) model for microgrid planning, which fully captures the battery degradation characteristics and the total carbon emissions. The microgrid operator aims to simultaneously maximize the economic benefits and minimize carbon emissions, and the degradation of the battery energy storage system (BESS) is modeled as a nonlinear function of power throughput. A self-adaptive multi-objective genetic algorithm (SAMOGA) is proposed to solve the SMOSO model, and this algorithm is enhanced by pre-grouped hierarchical selection and self-adaptive probabilities of crossover and mutation. Several case studies are conducted to determine the microgrid size by analyzing Pareto frontiers, and the simulation results validate that the proposed method has superior performance over other algorithms on the solution quality of optimum and diversity. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Transactions on Industry Applications

arXiv:2406.06295 [pdf, other]

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these model… ▽ More In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2406.02918 [pdf, other]

U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation

Authors: Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yifan Liu, Zhen Chen, Yixuan Yuan

Abstract: U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the… ▽ More U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U-KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. These endeavours unveil valuable insights and sheds light on the prospect that with U-KAN, you can make strong backbone for medical image segmentation and generation. Project page:\url{https://yes-u-kan.github.io/}. △ Less

Submitted 22 August, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.00356 [pdf, other]

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Authors: Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

Abstract: Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficie… ▽ More Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. △ Less

Submitted 9 July, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.16258 [pdf, other]

USD: Unsupervised Soft Contrastive Learning for Fault Detection in Multivariate Time Series

Authors: Hong Liu, Xiuxiu Qiu, Yiming Shi, Zelin Zang

Abstract: Unsupervised fault detection in multivariate time series is critical for maintaining the integrity and efficiency of complex systems, with current methodologies largely focusing on statistical and machine learning techniques. However, these approaches often rest on the assumption that data distributions conform to Gaussian models, overlooking the diversity of patterns that can manifest in both nor… ▽ More Unsupervised fault detection in multivariate time series is critical for maintaining the integrity and efficiency of complex systems, with current methodologies largely focusing on statistical and machine learning techniques. However, these approaches often rest on the assumption that data distributions conform to Gaussian models, overlooking the diversity of patterns that can manifest in both normal and abnormal states, thereby diminishing discriminative performance. Our innovation addresses this limitation by introducing a combination of data augmentation and soft contrastive learning, specifically designed to capture the multifaceted nature of state behaviors more accurately. The data augmentation process enriches the dataset with varied representations of normal states, while soft contrastive learning fine-tunes the model's sensitivity to the subtle differences between normal and abnormal patterns, enabling it to recognize a broader spectrum of anomalies. This dual strategy significantly boosts the model's ability to distinguish between normal and abnormal states, leading to a marked improvement in fault detection performance across multiple datasets and settings, thereby setting a new benchmark for unsupervised fault detection in complex systems. The code of our method is available at \url{https://github.com/zangzelin/code_USD.git}. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 19 pages, 7 figures, under review

arXiv:2405.12996 [pdf, other]

Dose-aware Diffusion Model for 3D Low-dose PET: Multi-institutional Validation with Reader Study and Real Low-dose Data

Authors: Huidong Xie, Weijie Gan, Bo Zhou, Ming-Kai Chen, Michal Kulon, Annemarie Boustani, Benjamin A. Spencer, Reimund Bayerlein, Xiongchao Chen, Qiong Liu, Xueqi Guo, Menghua Xia, Yinchi Zhou, Hui Liu, Liang Guo, Hongyu An, Ulugbek S. Kamilov, Hanzhong Wang, Biao Li, Axel Rominger, Kuangyu Shi, Ge Wang, Ramsey D. Badawi, Chi Liu

Abstract: As PET imaging is accompanied by radiation exposure and potentially increased cancer risk, reducing radiation dose in PET scans without compromising the image quality is an important topic. Deep learning (DL) techniques have been investigated for low-dose PET imaging. However, existing models have often resulted in compromised image quality when achieving low-dose PET and have limited generalizabi… ▽ More As PET imaging is accompanied by radiation exposure and potentially increased cancer risk, reducing radiation dose in PET scans without compromising the image quality is an important topic. Deep learning (DL) techniques have been investigated for low-dose PET imaging. However, existing models have often resulted in compromised image quality when achieving low-dose PET and have limited generalizability to different image noise-levels, acquisition protocols, patient populations, and hospitals. Recently, diffusion models have emerged as the new state-of-the-art generative model to generate high-quality samples and have demonstrated strong potential for medical imaging tasks. However, for low-dose PET imaging, existing diffusion models failed to generate consistent 3D reconstructions, unable to generalize across varying noise-levels, often produced visually-appealing but distorted image details, and produced images with biased tracer uptake. Here, we develop DDPET-3D, a dose-aware diffusion model for 3D low-dose PET imaging to address these challenges. Collected from 4 medical centers globally with different scanners and clinical protocols, we extensively evaluated the proposed model using a total of 9,783 18F-FDG studies (1,596 patients) with low-dose/low-count levels ranging from 1% to 50%. With a cross-center, cross-scanner validation, the proposed DDPET-3D demonstrated its potential to generalize to different low-dose levels, different scanners, and different clinical protocols. As confirmed with reader studies performed by nuclear medicine physicians, the proposed method produced superior denoised results that are comparable to or even better than the 100% full-count images as well as previous DL baselines. The presented results show the potential of achieving low-dose PET while maintaining image quality. Lastly, a group of real low-dose scans was also included for evaluation. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 16 Pages, 15 Figures, 4 Tables. Paper under review. arXiv admin note: substantial text overlap with arXiv:2311.04248

arXiv:2405.12609 [pdf, other]

Mamba in Speech: Towards an Alternative to Self-Attention

Authors: Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

Abstract: Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and comp… ▽ More Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research. △ Less

Submitted 30 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.12464 [pdf, other]

doi 10.1109/IV55156.2024.10588618

Evaluation of Connected Vehicle Identification-Aware Mixed Traffic Freeway Cooperative Merging

Authors: Haoji Liu, Fatemeh Jahedinia, Zeyu Mu, B. Brian Park

Abstract: Cooperative on-ramp merging control for connected automated vehicles (CAVs) has been extensively investigated. However, they did neglect the connected vehicle identification process, which is a must for CAV cooperations. In this paper, we introduced a connected vehicle identification system (VIS) into the on-ramp merging control process for the first time and proposed an evaluation framework to as… ▽ More Cooperative on-ramp merging control for connected automated vehicles (CAVs) has been extensively investigated. However, they did neglect the connected vehicle identification process, which is a must for CAV cooperations. In this paper, we introduced a connected vehicle identification system (VIS) into the on-ramp merging control process for the first time and proposed an evaluation framework to assess the impacts of VIS on on-ramp merging performance. First, the mixed-traffic cooperative merging problem was formulated. Then, a real-world merging trajectory dataset was processed to generate dangerous merging scenarios. Aiming at resolving the potential collision risks in mixed traffic where CAVs and traditional human-driven vehicles (THVs) coexist, we proposed on-ramp merging strategies for CAVs in different mixed traffic situations considering the connected vehicle identification process. The performances were evaluated via simulations. Results indicated that while safety was assured for all cases with CAVs, the cases with VIS had delayed initiation of cooperation, limiting the range of cooperative merging and leading to increased fuel consumption and acceleration variations. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: @2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: 2024 IEEE Intelligent Vehicles Symposium (IV)

arXiv:2405.12357 [pdf]

Paired Conditional Generative Adversarial Network for Highly Accelerated Liver 4D MRI

Authors: Di Xu, Xin Miao, Hengjie Liu, Jessica E. Scholey, Wensha Yang, Mary Feng, Michael Ohliger, Hui Lin, Yi Lao, Yang Yang, Ke Sheng

Abstract: Purpose: 4D MRI with high spatiotemporal resolution is desired for image-guided liver radiotherapy. Acquiring densely sampling k-space data is time-consuming. Accelerated acquisition with sparse samples is desirable but often causes degraded image quality or long reconstruction time. We propose the Reconstruct Paired Conditional Generative Adversarial Network (Re-Con-GAN) to shorten the 4D MRI rec… ▽ More Purpose: 4D MRI with high spatiotemporal resolution is desired for image-guided liver radiotherapy. Acquiring densely sampling k-space data is time-consuming. Accelerated acquisition with sparse samples is desirable but often causes degraded image quality or long reconstruction time. We propose the Reconstruct Paired Conditional Generative Adversarial Network (Re-Con-GAN) to shorten the 4D MRI reconstruction time while maintaining the reconstruction quality. Methods: Patients who underwent free-breathing liver 4D MRI were included in the study. Fully- and retrospectively under-sampled data at 3, 6 and 10 times (3x, 6x and 10x) were first reconstructed using the nuFFT algorithm. Re-Con-GAN then trained input and output in pairs. Three types of networks, ResNet9, UNet and reconstruction swin transformer, were explored as generators. PatchGAN was selected as the discriminator. Re-Con-GAN processed the data (3D+t) as temporal slices (2D+t). A total of 48 patients with 12332 temporal slices were split into training (37 patients with 10721 slices) and test (11 patients with 1611 slices). Results: Re-Con-GAN consistently achieved comparable/better PSNR, SSIM, and RMSE scores compared to CS/UNet models. The inference time of Re-Con-GAN, UNet and CS are 0.15s, 0.16s, and 120s. The GTV detection task showed that Re-Con-GAN and CS, compared to UNet, better improved the dice score (3x Re-Con-GAN 80.98%; 3x CS 80.74%; 3x UNet 79.88%) of unprocessed under-sampled images (3x 69.61%). Conclusion: A generative network with adversarial training is proposed with promising and efficient reconstruction results demonstrated on an in-house dataset. The rapid and qualitative reconstruction of 4D liver MR has the potential to facilitate online adaptive MR-guided radiotherapy for liver cancer. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.10948 [pdf, other]

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Authors: Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, Hongliang Ren

Abstract: Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recogni… ▽ More Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which strengthens the interaction between the grounding module and the language responses of the Large Visual Language Model (LVLM) after projecting them into the latent space. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset, which sets new performance standards. Our work contributes to advancing the field of automated surgical mentorship by providing a context-aware solution. △ Less

Submitted 22 March, 2024; originally announced May 2024.

arXiv:2405.10606 [pdf, other]

Carrier Aggregation Enabled MIMO-OFDM Integrated Sensing and Communication

Authors: Haotian Liu, Zhiqing Wei, Jinghui Piao, Huici Wu, Xingwang Li, Zhiyong Feng

Abstract: In the evolution towards the forthcoming era of sixth-generation (6G) mobile communication systems characterized by ubiquitous intelligence, integrated sensing and communication (ISAC) is in a phase of burgeoning development. However, the capabilities of communication and sensing within single frequency band fall short of meeting the escalating demands. To this end, this paper introduces a carrier… ▽ More In the evolution towards the forthcoming era of sixth-generation (6G) mobile communication systems characterized by ubiquitous intelligence, integrated sensing and communication (ISAC) is in a phase of burgeoning development. However, the capabilities of communication and sensing within single frequency band fall short of meeting the escalating demands. To this end, this paper introduces a carrier aggregation (CA)- enabled multi-input multi-output orthogonal frequency division multiplexing (MIMO-OFDM) ISAC system fusing the sensing data on high and low-frequency bands by symbol-level fusion for ultimate communication experience and high-accuracy sensing. The challenges in sensing signal processing introduced by CA include the initial phase misalignment of the echo signals on high and low-frequency bands due to attenuation and radar cross section, and the fusion of the sensing data on high and lowfrequency bands with different physical-layer parameters. To this end, the sensing signal processing is decomposed into two stages. In the first stage, the problem of initial phase misalignment of the echo signals on high and low-frequency bands is solved by the angle compensation, space-domain diversity and vector crosscorrelation operations. In the second stage, this paper realizes symbol-level fusion of the sensing data on high and low-frequency bands through sensing vector rearrangement and cyclic prefix adjustment operations, thereby obtaining high-precision sensing performance. Then, the closed-form communication mutual information (MI) and sensing Cramer-Rao lower bound (CRLB) for the proposed ISAC system are derived to explore the theoretical performance bound with CA. Simulation results validate the feasibility and superiority of the proposed ISAC system. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: 13page, 9figures, Submitted to IEEE Transactions on Wireless Communications

arXiv:2405.09179 [pdf, other]

Integrated Sensing and Communication Enabled Cooperative Passive Sensing Using Mobile Communication System

Authors: Zhiqing Wei, Haotian Liu, Hujun Li, Wangjun Jiang, Zhiyong Feng, Huici Wu, Ping Zhang

Abstract: Integrated sensing and communication (ISAC) is a potential technology of the sixth-generation (6G) mobile communication system, which enables communication base station (BS) with sensing capability. However, the performance of single-BS sensing is limited, which can be overcome by multi-BS cooperative sensing. There are three types of multi-BS cooperative sensing, including cooperative active sens… ▽ More Integrated sensing and communication (ISAC) is a potential technology of the sixth-generation (6G) mobile communication system, which enables communication base station (BS) with sensing capability. However, the performance of single-BS sensing is limited, which can be overcome by multi-BS cooperative sensing. There are three types of multi-BS cooperative sensing, including cooperative active sensing, cooperative passive sensing, and cooperative active and passive sensing, where the multi-BS cooperative passive sensing has the advantages of low hardware modification cost and large sensing coverage. However, multi-BS cooperative passive sensing faces the challenges of synchronization offsets mitigation and sensing information fusion. To address these challenges, a non-line of sight (NLoS) and line of sight (LoS) signal cross-correlation (NLCC) method is proposed to mitigate carrier frequency offset (CFO) and time offset (TO). Besides, a symbol-level multi-BS sensing information fusion method is proposed. The discrete samplings of echo signals from multiple BSs are matched independently and coherent accumulated to improve sensing accuracy. Moreover, a lowcomplexity joint angle-of-arrival (AoA) and angle-of-departure (AoD) estimation method is proposed to reduce the computational complexity. Simulation results show that symbol-level multi-BS cooperative passive sensing scheme has an order of magnitude higher sensing accuracy than single-BS passive sensing. This work provides a reference for the research on multi-BS cooperative passive sensing. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 16 pages, 11 figures, Submitted to IEEE Transactions on Mobile Computing

arXiv:2405.02873 [pdf, other]

Target Localization with Macro and Micro Base Stations Cooperative Sensing

Authors: Haotian Liu, Zhiqing Wei, Furong Yang, Huici Wu, Kaifeng Han, Zhiyong Feng

Abstract: Addressing the communication and sensing demands of sixth-generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered traction in academia and industry. With the sensing limitation of single base station (BS), multi-BS cooperative sensing is regarded as a promising solution. The coexistence and overlapped coverage of macro BS (MBS) and micro BS (MiBS) are… ▽ More Addressing the communication and sensing demands of sixth-generation (6G) mobile communication system, integrated sensing and communication (ISAC) has garnered traction in academia and industry. With the sensing limitation of single base station (BS), multi-BS cooperative sensing is regarded as a promising solution. The coexistence and overlapped coverage of macro BS (MBS) and micro BS (MiBS) are common in the development of 6G, making the cooperative sensing between MBS and MiBS feasible. Since MBS and MiBS work in low and high frequency bands, respectively, the challenges of MBS and MiBS cooperative sensing lie in the fusion method of the sensing information in high and low-frequency bands. To this end, this paper introduces a symbol-level fusion method and a grid-based three-dimensional discrete Fourier transform (3D-GDFT) algorithm to achieve precise localization of multiple targets with limited resources. Simulation results demonstrate that the proposed MBS and MiBS cooperative sensing scheme outperforms traditional single BS (MBS/MiBS) sensing scheme, showcasing superior sensing performance △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: 7 pages 6 figures, submitted to 2024 IEEE GLOBECOM

arXiv:2405.02788 [pdf, other]

Antenna Failure Resilience: Deep Learning-Enabled Robust DOA Estimation with Single Snapshot Sparse Arrays

Authors: Ruxin Zheng, Shunqiao Sun, Hongshan Liu, Honglei Chen, Mojtaba Soltanalian, Jian Li

Abstract: Recent advancements in Deep Learning (DL) for Direction of Arrival (DOA) estimation have highlighted its superiority over traditional methods, offering faster inference, enhanced super-resolution, and robust performance in low Signal-to-Noise Ratio (SNR) environments. Despite these advancements, existing research predominantly focuses on multi-snapshot scenarios, a limitation in the context of aut… ▽ More Recent advancements in Deep Learning (DL) for Direction of Arrival (DOA) estimation have highlighted its superiority over traditional methods, offering faster inference, enhanced super-resolution, and robust performance in low Signal-to-Noise Ratio (SNR) environments. Despite these advancements, existing research predominantly focuses on multi-snapshot scenarios, a limitation in the context of automotive radar systems which demand high angular resolution and often rely on limited snapshots, sometimes as scarce as a single snapshot. Furthermore, the increasing interest in sparse arrays for automotive radar, owing to their cost-effectiveness and reduced antenna element coupling, presents additional challenges including susceptibility to random sensor failures. This paper introduces a pioneering DL framework featuring a sparse signal augmentation layer, meticulously crafted to bolster single snapshot DOA estimation across diverse sparse array setups and amidst antenna failures. To our best knowledge, this is the first work to tackle this issue. Our approach improves the adaptability of deep learning techniques to overcome the unique difficulties posed by sparse arrays with single snapshot. We conduct thorough evaluations of our network's performance using simulated and real-world data, showcasing the efficacy and real-world viability of our proposed solution. The code and real-world dataset employed in this study are available at https://github.com/ruxinzh/Deep_RSA_DOA. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: Invited paper for IEEE Asilomar conference 2024

arXiv:2405.01104 [pdf, other]

Multi-user ISAC through Stacked Intelligent Metasurfaces: New Algorithms and Experiments

Authors: Ziqing Wang, Hongzheng Liu, Jianan Zhang, Rujing Xiong, Kai Wan, Xuewen Qian, Marco Di Renzo, Robert Caiming Qiu

Abstract: This paper investigates a Stacked Intelligent Metasurfaces (SIM)-assisted Integrated Sensing and Communications (ISAC) system. An extended target model is considered, where the BS aims to estimate the complete target response matrix relative to the SIM. Under the constraints of minimum Signal-to-Interference-plus-Noise Ratio (SINR) for the communication users (CUs) and maximum transmit power, we j… ▽ More This paper investigates a Stacked Intelligent Metasurfaces (SIM)-assisted Integrated Sensing and Communications (ISAC) system. An extended target model is considered, where the BS aims to estimate the complete target response matrix relative to the SIM. Under the constraints of minimum Signal-to-Interference-plus-Noise Ratio (SINR) for the communication users (CUs) and maximum transmit power, we jointly optimize the transmit beamforming at the base station (BS) and the end-to-end transmission matrix of the SIM, to minimize the Cramér-Rao Bound (CRB) for target estimation. Effective algorithms such as the alternating optimization (AO) and semidefinite relaxation (SDR) are employed to solve the non-convex SINR-constrained CRB minimization problem. Finally, we design and build an experimental platform for SIM, and evaluate the performance of the proposed algorithms for communication and sensing tasks. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Showing 1–50 of 487 results for author: Liu, H