Search | arXiv e-print repository

Foundation Models for Music: A Survey

Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg , et al. (18 additional authors not shown)

Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm. △ Less

Submitted 27 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.12255 [pdf, ps, other]

Fast Iterative ELAA-MIMO Detection Exploiting Static Channel Components

Authors: Jiuyu Liu, Yi Ma, Rahim Tafazolli

Abstract: Extremely large aperture array (ELAA) is a promising multiple-input multiple-output (MIMO) technique for next generation mobile networks. In this paper, we propose two novel approaches to accelerate the convergence of current iterative MIMO detectors in ELAA channels. Our approaches exploit the static components of the ELAA channel, which include line of sight (LoS) paths and deterministic non-LoS… ▽ More Extremely large aperture array (ELAA) is a promising multiple-input multiple-output (MIMO) technique for next generation mobile networks. In this paper, we propose two novel approaches to accelerate the convergence of current iterative MIMO detectors in ELAA channels. Our approaches exploit the static components of the ELAA channel, which include line of sight (LoS) paths and deterministic non-LoS (NLoS) components due to channel hardening effects. This paper proposes novel convergence acceleration techniques for fast iterative ELAA-MIMO detection by leveraging the static channel component, including the LoS paths and deterministic NLoS components that arise due to channel hardening. Specifically, these static channel components are utilized in two ways: as preconditioning matrices for general iterative algorithms, and as initialization for quasi-Newton (QN) methods. Simulation results show that the proposed approaches converge significantly faster compared to current iterative MIMO detectors, especially under strong LoS conditions with high Rician K-factor. Furthermore, QN methods with the proposed initialization matrix consistently achieve the best convergence performance while maintaining low complexity. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: This work has been accepted by the IEEE Information Theory Workshop (ITW) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2408.08746 [pdf, other]

Accelerating Iteratively Linear Detectors in Multi-User (ELAA-)MIMO Systems with UW-SVD

Authors: Jiuyu Liu, Yi Ma, Jinfei Wang, Rahim Tafazolli

Abstract: Current iterative multiple-input multiple-output (MIMO) detectors suffer from slow convergence when the wireless channel is ill-conditioned. The ill-conditioning is mainly caused by spatial correlation between channel columns corresponding to the same user equipment, known as intra-user interference. In addition, in the emerging MIMO systems using an extremely large aperture array (ELAA), spatial… ▽ More Current iterative multiple-input multiple-output (MIMO) detectors suffer from slow convergence when the wireless channel is ill-conditioned. The ill-conditioning is mainly caused by spatial correlation between channel columns corresponding to the same user equipment, known as intra-user interference. In addition, in the emerging MIMO systems using an extremely large aperture array (ELAA), spatial non-stationarity can make the channel even more ill-conditioned. In this paper, user-wise singular value decomposition (UW-SVD) is proposed to accelerate the convergence of iterative MIMO detectors. Its basic principle is to perform SVD on each user's sub-channel matrix to eliminate intra-user interference. Then, the MIMO signal model is effectively transformed into an equivalent signal (e-signal) model, comprising an e-channel matrix and an e-signal vector. Existing iterative algorithms can be used to recover the e-signal vector, which undergoes post-processing to obtain the signal vector. It is proven that the e-channel matrix is better conditioned than the original MIMO channel for spatially correlated (ELAA-)MIMO channels. This implies that UW-SVD can accelerate current iterative algorithms, which is confirmed by our simulation results. Specifically, it can speed up convergence by up to 10 times in both uncoded and coded systems. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: This work has been accepted by IEEE Transactions on Wireless Communications. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2408.07592 [pdf, other]

Multi-periodicity dependency Transformer based on spectrum offset for radio frequency fingerprint identification

Authors: Jing Xiao, Wenrui Ding, Zeqi Shao, Duona Zhang, Yanan Ma, Yufeng Wang, Jian Wang

Abstract: Radio Frequency Fingerprint Identification (RFFI) has emerged as a pivotal task for reliable device authentication. Despite advancements in RFFI methods, background noise and intentional modulation features result in weak energy and subtle differences in the RFF features. These challenges diminish the capability of RFFI methods in feature representation, complicating the effective identification o… ▽ More Radio Frequency Fingerprint Identification (RFFI) has emerged as a pivotal task for reliable device authentication. Despite advancements in RFFI methods, background noise and intentional modulation features result in weak energy and subtle differences in the RFF features. These challenges diminish the capability of RFFI methods in feature representation, complicating the effective identification of device identities. This paper proposes a novel Multi-Periodicity Dependency Transformer (MPDFormer) to address these challenges. The MPDFormer employs a spectrum offset-based periodic embedding representation to augment the discrepency of intrinsic features. We delve into the intricacies of the periodicity-dependency attention mechanism, integrating both inter-period and intra-period attention mechanisms. This mechanism facilitates the extraction of both long and short-range periodicity-dependency features , accentuating the feature distinction whilst concurrently attenuating the perturbations caused by background noise and weak-periodicity features. Empirical results demonstrate MPDFormer's superiority over established baseline methods, achieving a 0.07s inference time on NVIDIA Jetson Orin NX. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.07325 [pdf, other]

RoCoSDF: Row-Column Scanned Neural Signed Distance Fields for Freehand 3D Ultrasound Imaging Shape Reconstruction

Authors: Hongbo Chen, Yuchong Gao, Shuhang Zhang, Jiangjie Wu, Yuexin Ma, Rui Zheng

Abstract: The reconstruction of high-quality shape geometry is crucial for developing freehand 3D ultrasound imaging. However, the shape reconstruction of multi-view ultrasound data remains challenging due to the elevation distortion caused by thick transducer probes. In this paper, we present a novel learning-based framework RoCoSDF, which can effectively generate an implicit surface through continuous sha… ▽ More The reconstruction of high-quality shape geometry is crucial for developing freehand 3D ultrasound imaging. However, the shape reconstruction of multi-view ultrasound data remains challenging due to the elevation distortion caused by thick transducer probes. In this paper, we present a novel learning-based framework RoCoSDF, which can effectively generate an implicit surface through continuous shape representations derived from row-column scanned datasets. In RoCoSDF, we encode the datasets from different views into the corresponding neural signed distance function (SDF) and then operate all SDFs in a normalized 3D space to restore the actual surface contour. Without requiring pre-training on large-scale ground truth shapes, our approach can synthesize a smooth and continuous signed distance field from multi-view SDFs to implicitly represent the actual geometry. Furthermore, two regularizers are introduced to facilitate shape refinement by constraining the SDF near the surface. The experiments on twelve shapes data acquired by two ultrasound transducer probes validate that RoCoSDF can effectively reconstruct accurate geometric shapes from multi-view ultrasound data, which outperforms current reconstruction methods. Code is available at https://github.com/chenhbo/RoCoSDF. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: Accepted by MICCAI 2024

arXiv:2408.03393 [pdf, other]

Biomedical Image Segmentation: A Systematic Literature Review of Deep Learning Based Object Detection Methods

Authors: Fazli Wahid, Yingliang Ma, Dawar Khan, Muhammad Aamir, Syed U. K. Bukhari

Abstract: Biomedical image segmentation plays a vital role in diagnosis of diseases across various organs. Deep learning-based object detection methods are commonly used for such segmentation. There exists an extensive research in this topic. However, there is no standard review on this topic. Existing surveys often lack a standardized approach or focus on broader segmentation techniques. In this paper, we… ▽ More Biomedical image segmentation plays a vital role in diagnosis of diseases across various organs. Deep learning-based object detection methods are commonly used for such segmentation. There exists an extensive research in this topic. However, there is no standard review on this topic. Existing surveys often lack a standardized approach or focus on broader segmentation techniques. In this paper, we conducted a systematic literature review (SLR), collected and analysed 148 articles that explore deep learning object detection methods for biomedical image segmentation. We critically analyzed these methods, identified the key challenges, and discussed the future directions. From the selected articles we extracted the results including the deep learning models, targeted imaging modalities, targeted diseases, and the metrics for the analysis of the methods. The results have been presented in tabular and/or charted forms. The results are presented in three major categories including two stage detection models, one stage detection models and point-based detection models. Each article is individually analyzed along with its pros and cons. Finally, we discuss open challenges, potential benefits, and future research directions. This SLR aims to provide the research community with a quick yet deeper understanding of these segmentation models, ultimately facilitating the development of more powerful solutions for biomedical image analysis. △ Less

Submitted 28 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

arXiv:2408.02943 [pdf, other]

Recent Advances in Data-driven Intelligent Control for Wireless Communication: A Comprehensive Survey

Authors: Wei Huo, Huiwen Yang, Nachuan Yang, Zhaohua Yang, Jiuzhou Zhang, Fuhai Nan, Xingzhou Chen, Yifan Mao, Suyang Hu, Pengyu Wang, Xuanyu Zheng, Mingming Zhao, Ling Shi

Abstract: The advent of next-generation wireless communication systems heralds an era characterized by high data rates, low latency, massive connectivity, and superior energy efficiency. These systems necessitate innovative and adaptive strategies for resource allocation and device behavior control in wireless networks. Traditional optimization-based methods have been found inadequate in meeting the complex… ▽ More The advent of next-generation wireless communication systems heralds an era characterized by high data rates, low latency, massive connectivity, and superior energy efficiency. These systems necessitate innovative and adaptive strategies for resource allocation and device behavior control in wireless networks. Traditional optimization-based methods have been found inadequate in meeting the complex demands of these emerging systems. As the volume of data continues to escalate, the integration of data-driven methods has become indispensable for enabling adaptive and intelligent control mechanisms in future wireless communication systems. This comprehensive survey explores recent advancements in data-driven methodologies applied to wireless communication networks. It focuses on developments over the past five years and their application to various control objectives within wireless cyber-physical systems. It encompasses critical areas such as link adaptation, user scheduling, spectrum allocation, beam management, power control, and the co-design of communication and control systems. We provide an in-depth exploration of the technical underpinnings that support these data-driven approaches, including the algorithms, models, and frameworks developed to enhance network performance and efficiency. We also examine the challenges that current data-driven algorithms face, particularly in the context of the dynamic and heterogeneous nature of next-generation wireless networks. The paper provides a critical analysis of these challenges and offers insights into potential solutions and future research directions. This includes discussing the adaptability, integration with 6G, and security of data-driven methods in the face of increasing network complexity and data volume. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2407.21531 [pdf, other]

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Authors: Ziya Zhou, Yuhang Wu, Zhiyue Wu, Xinyue Zhang, Ruibin Yuan, Yinghao Ma, Lu Wang, Emmanouil Benetos, Wei Xue, Yike Guo

Abstract: Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step re… ▽ More Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians. △ Less

Submitted 31 July, 2024; originally announced July 2024.

Comments: Accepted by ISMIR2024

arXiv:2407.13703 [pdf, other]

Energy-Efficient Channel Decoding for Wireless Federated Learning: Convergence Analysis and Adaptive Design

Authors: Linping Qu, Yuyi Mao, Shenghui Song, Chi-Ying Tsui

Abstract: One of the most critical challenges for deploying distributed learning solutions, such as federated learning (FL), in wireless networks is the limited battery capacity of mobile clients. While it is a common belief that the major energy consumption of mobile clients comes from the uplink data transmission, this paper presents a novel finding, namely the channel decoding operation also contributes… ▽ More One of the most critical challenges for deploying distributed learning solutions, such as federated learning (FL), in wireless networks is the limited battery capacity of mobile clients. While it is a common belief that the major energy consumption of mobile clients comes from the uplink data transmission, this paper presents a novel finding, namely the channel decoding operation also contributes significantly to the overall energy consumption of mobile clients in FL. Motivated by this new observation, we propose an energy-efficient adaptive channel decoding scheme that leverages the intrinsic robustness of FL to model errors. In particular, the robustness is exploited to reduce the energy consumption of channel decoders at mobile clients by adaptively adjusting the number of decoding iterations. We theoretically prove that wireless FL with communication errors can converge at the same rate as the case with error-free communication as long as the bit error rate (BER) is properly constrained. An adaptive channel decoding scheme is then proposed to improve the energy efficiency of wireless FL systems. Experimental results demonstrate that the proposed method maintains the same learning accuracy while reducing the channel decoding energy consumption by 20% when compared to existing approaches. △ Less

Submitted 19 July, 2024; v1 submitted 26 June, 2024; originally announced July 2024.

Comments: This work has been submitted to the IEEE TWC for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2407.08948 [pdf, other]

Symmetry Awareness Encoded Deep Learning Framework for Brain Imaging Analysis

Authors: Yang Ma, Dongang Wang, Peilin Liu, Lynette Masters, Michael Barnett, Weidong Cai, Chenyu Wang

Abstract: The heterogeneity of neurological conditions, ranging from structural anomalies to functional impairments, presents a significant challenge in medical imaging analysis tasks. Moreover, the limited availability of well-annotated datasets constrains the development of robust analysis models. Against this backdrop, this study introduces a novel approach leveraging the inherent anatomical symmetrical… ▽ More The heterogeneity of neurological conditions, ranging from structural anomalies to functional impairments, presents a significant challenge in medical imaging analysis tasks. Moreover, the limited availability of well-annotated datasets constrains the development of robust analysis models. Against this backdrop, this study introduces a novel approach leveraging the inherent anatomical symmetrical features of the human brain to enhance the subsequent detection and segmentation analysis for brain diseases. A novel Symmetry-Aware Cross-Attention (SACA) module is proposed to encode symmetrical features of left and right hemispheres, and a proxy task to detect symmetrical features as the Symmetry-Aware Head (SAH) is proposed, which guides the pretraining of the whole network on a vast 3D brain imaging dataset comprising both healthy and diseased brain images across various MRI and CT. Through meticulous experimentation on downstream tasks, including both classification and segmentation for brain diseases, our model demonstrates superior performance over state-of-the-art methodologies, particularly highlighting the significance of symmetry-aware learning. Our findings advocate for the effectiveness of incorporating symmetry awareness into pretraining and set a new benchmark for medical imaging analysis, promising significant strides toward accurate and efficient diagnostic processes. Code is available at https://github.com/bitMyron/sa-swin. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: MICCAI 2024

ACM Class: I.2.10; I.4.10

arXiv:2407.06530 [pdf, ps, other]

RS-BNN: A Deep Learning Framework for the Optimal Beamforming Design of Rate-Splitting Multiple Access

Authors: Yiwen Wang, Yijie Mao, Sijie Ji

Abstract: Rate splitting multiple access (RSMA) relies on beamforming design for attaining spectral efficiency and energy efficiency gains over traditional multiple access schemes. While conventional optimization approaches such as weighted minimum mean square error (WMMSE) achieve suboptimal solutions for RSMA beamforming optimization, they are computationally demanding. A novel approach based on fractiona… ▽ More Rate splitting multiple access (RSMA) relies on beamforming design for attaining spectral efficiency and energy efficiency gains over traditional multiple access schemes. While conventional optimization approaches such as weighted minimum mean square error (WMMSE) achieve suboptimal solutions for RSMA beamforming optimization, they are computationally demanding. A novel approach based on fractional programming (FP) has unveiled the optimal beamforming structure (OBS) for RSMA. This method, combined with a hyperplane fixed point iteration (HFPI) approach, named FP-HFPI, provides suboptimal beamforming solutions with identical sum rate performance but much lower computational complexity compared to WMMSE. Inspired by such an approach, in this work, a novel deep unfolding framework based on FP-HFPI, named rate-splitting-beamforming neural network (RS-BNN), is proposed to unfold the FP-HFPI algorithm. Numerical results indicate that the proposed RS-BNN attains a level of performance closely matching that of WMMSE and FP-HFPI, while dramatically reducing the computational complexity. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05155 [pdf, other]

Wi-Fi Beyond Communications: Experimental Evaluation of Respiration Monitoring and Motion Detection Using COTS Devices

Authors: Jiuyu Liu, Yi Ma, Rahim Tafazolli

Abstract: Wi-Fi sensing has become an attractive option for non-invasive monitoring of human activities and vital signs. This paper explores the feasibility of using state-of-the-art commercial off-the-shelf (COTS) devices for Wi-Fi sensing applications, particularly respiration monitoring and motion detection. We utilize the Intel AX210 network interface card (NIC) to transmit Wi-Fi signals in both 2.4 GHz… ▽ More Wi-Fi sensing has become an attractive option for non-invasive monitoring of human activities and vital signs. This paper explores the feasibility of using state-of-the-art commercial off-the-shelf (COTS) devices for Wi-Fi sensing applications, particularly respiration monitoring and motion detection. We utilize the Intel AX210 network interface card (NIC) to transmit Wi-Fi signals in both 2.4 GHz and 6 GHz frequency bands. Our experiments rely on channel frequency response (CFR) and received signal strength indicator (RSSI) data, which are processed using a moving average algorithm to extract human behavior patterns. The experimental results demonstrate the effectiveness of our approach in capturing and representing human respiration and motion patterns. Furthermore, we compare the performance of Wi-Fi sensing across different frequency bands, highlighting the advantages of using higher frequencies for improved sensitivity and clarity. Our findings showcase the practicality of using COTS devices for Wi-Fi sensing and lay the groundwork for the development of non-invasive, contactless sensing systems. These systems have potential applications in various fields, including healthcare, smart homes, and Metaverse. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: This work has been accepted by IEEE ICCC Workshop 2024. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2407.03050 [pdf, other]

Semantic-Aware Power Allocation for Generative Semantic Communications with Foundation Models

Authors: Chunmei Xu, Mahdi Boloursaz Mashhadi, Yi Ma, Rahim Tafazolli

Abstract: Recent advancements in diffusion models have made a significant breakthrough in generative modeling. The combination of the generative model and semantic communication (SemCom) enables high-fidelity semantic information exchange at ultra-low rates. A novel generative SemCom framework for image tasks is proposed, wherein pre-trained foundation models serve as semantic encoders and decoders for sema… ▽ More Recent advancements in diffusion models have made a significant breakthrough in generative modeling. The combination of the generative model and semantic communication (SemCom) enables high-fidelity semantic information exchange at ultra-low rates. A novel generative SemCom framework for image tasks is proposed, wherein pre-trained foundation models serve as semantic encoders and decoders for semantic feature extractions and image regenerations, respectively. The mathematical relationship between the transmission reliability and the perceptual quality of the regenerated image and the semantic values of semantic features are modeled, which are obtained by conducting numerical simulations on the Kodak dataset. We also investigate the semantic-aware power allocation problem, with the objective of minimizing the total power consumption while guaranteeing semantic performance. To solve this problem, two semanticaware power allocation methods are proposed by constraint decoupling and bisection search, respectively. Numerical results show that the proposed semantic-aware methods demonstrate superior performance compared to the conventional one in terms of total power consumption. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.00196 [pdf, other]

Multi-Satellite MIMO Systems for Direct User-Satellite Communications: A Survey

Authors: Zohre Mashayekh Bakhsh, Yasaman Omid, Gaojie Chen, Farbod Kayhan, Yi Ma, Rahim Tafazolli

Abstract: Advancements in satellite technology have made direct-to-device connectivity a viable solution for ensuring global access. This method is designed to provide internet connectivity to remote, rural, or underserved areas where traditional cellular or broadband networks are lacking or insufficient. This paper is a survey providing an in-depth review of multi-satellite Multiple Input Multiple Output (… ▽ More Advancements in satellite technology have made direct-to-device connectivity a viable solution for ensuring global access. This method is designed to provide internet connectivity to remote, rural, or underserved areas where traditional cellular or broadband networks are lacking or insufficient. This paper is a survey providing an in-depth review of multi-satellite Multiple Input Multiple Output (MIMO) systems as a potential solution for addressing the link budget challenge in direct user-satellite communication. Special attention is given to works considering multi-satellite MIMO systems, both with and without satellite collaboration. In this context, collaboration refers to sharing data between satellites to improve the performance of the system. This survey begins by explaining several fundamental aspects of satellite communications (SatComs), which are vital prerequisites before investigating the multi-satellite MIMO systems. These aspects encompass satellite orbits, the structure of satellite systems, SatCom links, including the inter-satellite links (ISL) which facilitate satellite cooperation, satellite frequency bands, satellite antenna design, and satellite channel models, which should be known or estimated for effective data transmission to and from multiple satellites. Furthermore, this survey distinguishes itself by providing more comprehensive insights in comparison to other surveys. It specifically delves into the Orthogonal Time Frequency Space (OTFS) within the channel model section. It goes into detail about ISL noise and channel models, and it extends the ISL section by thoroughly investigating hybrid FSO/RF ISLs. Furthermore, analytical comparisons of simulation results from these works are presented to highlight the advantages of employing multi-satellite MIMO systems. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: 29 pages, 11 figures, 6 tables, IEEE Communication Survey and Tutorials

arXiv:2406.18549 [pdf]

Advancements in Feature Extraction Recognition of Medical Imaging Systems Through Deep Learning Technique

Authors: Qishi Zhan, Dan Sun, Erdi Gao, Yuhan Ma, Yaxin Liang, Haowei Yang

Abstract: This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simple… ▽ More This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simplex algorithm is presented. Aiming at the nonlinear characteristics of hyperspectral images, a generalized discriminant analysis algorithm based on kernel function is proposed. In this project, a hyperspectral remote sensing image is taken as the object, and we investigate its mathematical modeling, solution methods, and feature extraction techniques. It is found that different types of objects are independent of each other and compact in image processing. Compared with the traditional linear discrimination method, the result of image segmentation is better. This method can not only overcome the disadvantage of the traditional method which is easy to be affected by light, but also extract the features of the object quickly and accurately. It has important reference significance for clinical diagnosis. △ Less

Submitted 23 May, 2024; originally announced June 2024.

Comments: conference

arXiv:2406.16323 [pdf, other]

Low-Complexity CSI Feedback for FDD Massive MIMO Systems via Learning to Optimize

Authors: Yifan Ma, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

Abstract: In frequency-division duplex (FDD) massive multiple-input multiple-output (MIMO) systems, the growing number of base station antennas leads to prohibitive feedback overhead for downlink channel state information (CSI). To address this challenge, state-of-the-art (SOTA) fully data-driven deep learning (DL)-based CSI feedback schemes have been proposed. However, the high computational complexity and… ▽ More In frequency-division duplex (FDD) massive multiple-input multiple-output (MIMO) systems, the growing number of base station antennas leads to prohibitive feedback overhead for downlink channel state information (CSI). To address this challenge, state-of-the-art (SOTA) fully data-driven deep learning (DL)-based CSI feedback schemes have been proposed. However, the high computational complexity and memory requirements of these methods hinder their practical deployment on resource-constrained devices like mobile phones. To solve the problem, we propose a model-driven DL-based CSI feedback approach by integrating the wisdom of compressive sensing and learning to optimize (L2O). Specifically, only a linear learnable projection is adopted at the encoder side to compress the CSI matrix, thereby significantly cutting down the user-side complexity and memory expenditure. On the other hand, the decoder incorporates two specially designed components, i.e., a learnable sparse transformation and an element-wise L2O reconstruction module. The former is developed to learn a sparse basis for CSI within the angular domain, which explores channel sparsity effectively. The latter shares the same long short term memory (LSTM) network across all elements of the optimization variable, eliminating the retraining cost when problem scale changes. Simulation results show that the proposed method achieves a comparable performance with the SOTA CSI feedback scheme but with much-reduced complexity, and enables multiple-rate feedback. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: submitted to IEEE for publication

arXiv:2406.14333 [pdf, other]

LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation

Authors: Rebecca Salganik, Xiaohao Liu, Yunshan Ma, Jian Kang, Tat-Seng Chua

Abstract: As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation.… ▽ More As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation. However, such methods will struggle to recommend songs that lack interaction data, an issue known as the cold-start problem. Current approaches to this challenge design complex mechanisms for extracting relational signals from sparse collaborative data and integrating them into content representations. However, these approaches leave content representation learning out of scope and utilize frozen, pre-trained content models that may not be aligned with the distribution or format of a specific musical setting. Furthermore, even the musical state-of-the-art content modules are either (1) incompatible with the cold-start setting or (2) unable to effectively integrate cross-modal and relational signals. In this paper, we introduce LARP, a multi-modal cold-start playlist continuation model, to effectively overcome these limitations. LARP is a three-stage contrastive learning framework that integrates both multi-modal and relational signals into its learned representations. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss. Experimental results on two publicly available datasets demonstrate the efficacy of LARP over uni-modal and multi-modal models for playlist continuation in a cold-start setting. Code and dataset are released at: https://github.com/Rsalganik1123/LARP. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14264 [pdf, other]

Zero-Shot Image Denoising for High-Resolution Electron Microscopy

Authors: Xuanyu Tian, Zhuoya Dong, Xiyue Lin, Yue Gao, Hongjiang Wei, Yanhang Ma, Jingyi Yu, Yuyao Zhang

Abstract: High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we… ▽ More High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 12 pages, 12 figures

arXiv:2406.04740 [pdf, other]

Activation Map-based Vector Quantization for 360-degree Image Semantic Communication

Authors: Yang Ma, Wenchi Cheng, Jingqing Wang, Wei Zhang

Abstract: In virtual reality (VR) applications, 360-degree images play a pivotal role in crafting immersive experiences and offering panoramic views, thus improving user Quality of Experience (QoE). However, the voluminous data generated by 360-degree images poses challenges in network storage and bandwidth. To address these challenges, we propose a novel Activation Map-based Vector Quantization (AM-VQ) fra… ▽ More In virtual reality (VR) applications, 360-degree images play a pivotal role in crafting immersive experiences and offering panoramic views, thus improving user Quality of Experience (QoE). However, the voluminous data generated by 360-degree images poses challenges in network storage and bandwidth. To address these challenges, we propose a novel Activation Map-based Vector Quantization (AM-VQ) framework, which is designed to reduce communication overhead for wireless transmission. The proposed AM-VQ scheme uses the Deep Neural Networks (DNNs) with vector quantization (VQ) to extract and compress semantic features. Particularly, the AM-VQ framework utilizes activation map to adaptively quantize semantic features, thus reducing data distortion caused by quantization operation. To further enhance the reconstruction quality of the 360-degree image, adversarial training with a Generative Adversarial Networks (GANs) discriminator is incorporated. Numerical results show that our proposed AM-VQ scheme achieves better performance than the existing Deep Learning (DL) based coding and the traditional coding schemes under the same transmission symbols. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.02483 [pdf, other]

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Authors: Tianchi Liu, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li

Abstract: Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artif… ▽ More Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2406.02009 [pdf, other]

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method. △ Less

Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.00233 [pdf, other]

Plug-in UL-CSI-Assisted Precoder Upsampling Approach in Cellular FDD Systems

Authors: Yu-Chien Lin, Yan Xin, Ta-Sung Lee, Charlie, Zhang, Yibo Ma, Zhi Ding

Abstract: Acquiring downlink channel state information (CSI) is crucial for optimizing performance in massive Multiple Input Multiple Output (MIMO) systems operating under Frequency-Division Duplexing (FDD). Most cellular wireless communication systems employ codebook-based precoder designs, which offer advantages such as simpler, more efficient feedback mechanisms and reduced feedback overhead. Common code… ▽ More Acquiring downlink channel state information (CSI) is crucial for optimizing performance in massive Multiple Input Multiple Output (MIMO) systems operating under Frequency-Division Duplexing (FDD). Most cellular wireless communication systems employ codebook-based precoder designs, which offer advantages such as simpler, more efficient feedback mechanisms and reduced feedback overhead. Common codebook-based approaches include Type II and eType II precoding methods defined in the 3GPP standards. Feedback in these systems is typically standardized per subband (SB), allowing user equipment (UE) to select the optimal precoder from the codebook for each SB, thereby reducing feedback overhead. However, this subband-level feedback resolution may not suffice for frequency-selective channels. This paper addresses this issue by introducing an uplink CSI-assisted precoder upsampling module deployed at the gNodeB. This module upsamples SB-level precoders to resource block (RB)-level precoders, acting as a plug-in compatible with existing gNodeB or base stations. △ Less

Submitted 31 May, 2024; originally announced June 2024.

arXiv:2406.00085 [pdf, other]

Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification

Authors: Yunling Ma, Chaojun Zhang, Xiaochuan Wang, Qianqian Wang, Liang Cao, Limei Zhang, Mingxia Liu

Abstract: Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would… ▽ More Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model. △ Less

Submitted 6 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

arXiv:2405.09552 [pdf, other]

ODFormer: Semantic Fundus Image Segmentation Using Transformer for Optic Nerve Head Detection

Authors: Jiayi Wang, Yi-An Mao, Xiaoyu Ma, Sicen Guo, Yuting Shao, Xiao Lv, Wenting Han, Mark Christopher, Linda M. Zangwill, Yanlong Bi, Rui Fan

Abstract: Optic nerve head (ONH) detection has been a crucial area of study in ophthalmology for years. However, the significant discrepancy between fundus image datasets, each generated using a single type of fundus camera, poses challenges to the generalizability of ONH detection approaches developed based on semantic segmentation networks. Despite the numerous recent advancements in general-purpose seman… ▽ More Optic nerve head (ONH) detection has been a crucial area of study in ophthalmology for years. However, the significant discrepancy between fundus image datasets, each generated using a single type of fundus camera, poses challenges to the generalizability of ONH detection approaches developed based on semantic segmentation networks. Despite the numerous recent advancements in general-purpose semantic segmentation methods using convolutional neural networks (CNNs) and Transformers, there is currently a lack of benchmarks for these state-of-the-art (SoTA) networks specifically trained for ONH detection. Therefore, in this article, we make contributions from three key aspects: network design, the publication of a dataset, and the establishment of a comprehensive benchmark. Our newly developed ONH detection network, referred to as ODFormer, is based upon the Swin Transformer architecture and incorporates two novel components: a multi-scale context aggregator and a lightweight bidirectional feature recalibrator. Our published large-scale dataset, known as TongjiU-DROD, provides multi-resolution fundus images for each participant, captured using two distinct types of cameras. Our established benchmark involves three datasets: DRIONS-DB, DRISHTI-GS1, and TongjiU-DROD, created by researchers from different countries and containing fundus images captured from participants of diverse races and ages. Extensive experimental results demonstrate that our proposed ODFormer outperforms other state-of-the-art (SoTA) networks in terms of performance and generalizability. Our dataset and source code are publicly available at mias.group/ODFormer. △ Less

Submitted 2 June, 2024; v1 submitted 15 April, 2024; originally announced May 2024.

arXiv:2405.08288 [pdf, other]

Orthogonal Delay-Doppler Division Multiplexing Modulation with Tomlinson-Harashima Precoding

Authors: Yiyan Ma, Akram Shafie, Jinhong Yuan, Guoyu Ma, Zhangdui Zhong, Bo Ai

Abstract: The orthogonal delay-Doppler (DD) division multiplexing(ODDM) modulation has been recently proposed as a promising modulation scheme for next-generation communication systems with high mobility. Despite its benefits, ODDM modulation and other DD domain modulation schemes face the challenge of excessive equalization complexity. To address this challenge, we propose time domain Tomlinson-Harashima p… ▽ More The orthogonal delay-Doppler (DD) division multiplexing(ODDM) modulation has been recently proposed as a promising modulation scheme for next-generation communication systems with high mobility. Despite its benefits, ODDM modulation and other DD domain modulation schemes face the challenge of excessive equalization complexity. To address this challenge, we propose time domain Tomlinson-Harashima precoding (THP) for the ODDM transmitter, to make the DD domain single-tap equalizer feasible, thereby reducing the equalization complexity. In our design, we first pre-cancel the inter-symbolinterference (ISI) using the linear time-varying (LTV) channel information. Second, different from classical THP designs, we introduce a modified modulo operation with an adaptive modulus, by which the joint DD domain data multiplexing and timedomain ISI pre-cancellation can be realized without excessively increasing the bit errors. We then analytically study the losses encountered in this design, namely the power loss, the modulo noise loss, and the modulo signal loss. Based on this analysis, BER lower bounds of the ODDM system with time domain THP are derived when 4-QAM or 16-QAM modulations are adopted for symbol mapping in the DD domain. Finally, through numerical results, we validate our analysis and then demonstrate that the ODDM system with time domain THP is a promising solution to realize better BER performance over LTV channels compared to orthogonal frequency division multiplexing systems with single-tap equalizer and ODDM systems with maximum ratio combining. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.00739 [pdf, other]

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Authors: Chenqi Guo, Shiwei Zhong, Xiaofeng Liu, Qianli Feng, Yinglong Ma

Abstract: Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize th… ▽ More Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. By increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data augmentation fosters a broader perspective provided by the divergent teacher ensemble and lower student-teacher mutual information, benefiting generalization performance. These insights clarify the mechanism on low-fidelity phenomenon in KD. Thus, we offer new perspectives on optimizing student model performance, by emphasizing increased diversity in teacher attentions and reduced mimicry behavior between teachers and student. △ Less

Submitted 29 April, 2024; originally announced May 2024.

arXiv:2404.18081 [pdf, other]

ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Authors: Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo

Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C… ▽ More Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions. △ Less

Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.12604 [pdf, ps, other]

Transmitter Side Beyond-Diagonal RIS for mmWave Integrated Sensing and Communications

Authors: Kexin Chen, Yijie Mao

Abstract: This work initiates the study of a beyond-diagonal reconfigurable intelligent surface (BD-RIS)-aided transmitter architecture for integrated sensing and communication (ISAC) in the millimeter-wave (mmWave) frequency band. Deploying BD-RIS at the transmitter side not only alleviates the need for extensive fully digital radio frequency (RF) chains but also enhances both communication and sensing per… ▽ More This work initiates the study of a beyond-diagonal reconfigurable intelligent surface (BD-RIS)-aided transmitter architecture for integrated sensing and communication (ISAC) in the millimeter-wave (mmWave) frequency band. Deploying BD-RIS at the transmitter side not only alleviates the need for extensive fully digital radio frequency (RF) chains but also enhances both communication and sensing performance. These benefits are facilitated by the additional design flexibility introduced by the fully-connected scattering matrix of BD-RIS. To achieve the aforementioned benefits, in this work, we propose an efficient two-stage algorithm to design the digital beamforming of the transmitter and the scattering matrix of the BD-RIS with the aim of jointly maximizing the sum rate for multiple communication users and minimizing the largest eigenvalue of the Cramer-Rao bound (CRB) matrix for multiple sensing targets. Numerical results show that the transmitter-side BD-RIS-aided mmWave ISAC outperforms the conventional diagonal-RIS-aided ones in both communication and sensing performance. △ Less

Submitted 25 April, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.12595 [pdf, other]

Deep Reinforcement Learning-aided Transmission Design for Energy-efficient Link Optimization in Vehicular Communications

Authors: Zhengpeng Wang, Yanqun Tang, Yingzhe Mao, Tao Wang, Xiunan Huang

Abstract: This letter presents a deep reinforcement learning (DRL) approach for transmission design to optimize the energy efficiency in vehicle-to-vehicle (V2V) communication links. Considering the dynamic environment of vehicular communications, the optimization problem is non-convex and mathematically difficult to solve. Hence, we propose scenario identification-based double and Dueling deep Q-Network (S… ▽ More This letter presents a deep reinforcement learning (DRL) approach for transmission design to optimize the energy efficiency in vehicle-to-vehicle (V2V) communication links. Considering the dynamic environment of vehicular communications, the optimization problem is non-convex and mathematically difficult to solve. Hence, we propose scenario identification-based double and Dueling deep Q-Network (SI-D3QN), a DRL algorithm integrating both double deep Q-Network and Dueling deep Q-Network, for the joint design of modulation and coding scheme (MCS) selection and power control. To be more specific, we employ SI techique to enhance link performance and assit the D3QN agent in refining its decision-making processes. The experiment results demonstrate that, across various optimization tasks, our proposed SI-D3QN agent outperforms the benchmark algorithms in terms of the valid actions and link performance metrics. Particularly, while ensuring significant improvement in energy efficiency, the agent facilitates a 29.6% enhancement in the link throughput under the same energy consumption. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 5 pages, 3 figures

arXiv:2404.11383 [pdf, other]

Lower Limb Movements Recognition Based on Feature Recursive Elimination and Backpropagation Neural Network

Authors: Yongkai Ma, Shili Liang, Zekun Chen

Abstract: Surface electromyographic (sEMG) signal serve as a signal source commonly used for lower limb movement recognition, reflecting the intent of human movement. However, it has been a challenge to improve the movements recognition rate while using fewer features in this area of research area. In this paper, a method for lower limb movements recognition based on recursive feature elimination and backpr… ▽ More Surface electromyographic (sEMG) signal serve as a signal source commonly used for lower limb movement recognition, reflecting the intent of human movement. However, it has been a challenge to improve the movements recognition rate while using fewer features in this area of research area. In this paper, a method for lower limb movements recognition based on recursive feature elimination and backpropagation neural network of support vector machine is proposed. First, the sEMG signal of five subjects performing eight different lower limb movements was recorded using a BIOPAC collector. The optimal feature subset consists of 25 feature vectors, determined using a Recursive Feature Elimination based on Support Vector Machine (SVM-RFE). Finally, this study used five supervised classification algorithms to recognize these eight different lower limb movements. The results of the experimental study show that the combination of the BPNN classifier and the SVM-RFE feature selection algorithm is able to achieve an excellent action recognition accuracy of 95\%, which provides sufficient support for the feasibility of this approach. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2404.07473 [pdf]

LUCF-Net: Lightweight U-shaped Cascade Fusion Network for Medical Image Segmentation

Authors: Songkai Sun, Qingshan She, Yuliang Ma, Rihui Li, Yingchun Zhang

Abstract: In this study, the performance of existing U-shaped neural network architectures was enhanced for medical image segmentation by adding Transformer. Although Transformer architectures are powerful at extracting global information, its ability to capture local information is limited due to its high complexity. To address this challenge, we proposed a new lightweight U-shaped cascade fusion network (… ▽ More In this study, the performance of existing U-shaped neural network architectures was enhanced for medical image segmentation by adding Transformer. Although Transformer architectures are powerful at extracting global information, its ability to capture local information is limited due to its high complexity. To address this challenge, we proposed a new lightweight U-shaped cascade fusion network (LUCF-Net) for medical image segmentation. It utilized an asymmetrical structural design and incorporated both local and global modules to enhance its capacity for local and global modeling. Additionally, a multi-layer cascade fusion decoding network was designed to further bolster the network's information fusion capabilities. Validation results achieved on multi-organ datasets in CT format, cardiac segmentation datasets in MRI format, and dermatology datasets in image format demonstrated that the proposed model outperformed other state-of-the-art methods in handling local-global information, achieving an improvement of 1.54% in Dice coefficient and 2.6 mm in Hausdorff distance on multi-organ segmentation. Furthermore, as a network that combines Convolutional Neural Network and Transformer architectures, it achieves competitive segmentation performance with only 6.93 million parameters and 6.6 gigabytes of floating point operations, without the need of pre-training. In summary, the proposed method demonstrated enhanced performance while retaining a simpler model design compared to other Transformer-based segmentation networks. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.06393 [pdf, other]

MuPT: A Generative Symbolic Music Pretrained Transformer

Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions. △ Less

Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04916 [pdf, other]

Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder

Authors: Yiyang Ma, Wenhan Yang, Jiaying Liu

Abstract: The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, w… ▽ More The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods. △ Less

Submitted 2 May, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: Accepted by ICML 2024

arXiv:2404.01716 [pdf, other]

Effective internal language model training and fusion for factorized transducer model

Authors: Jinxi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for… ▽ More The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Accepted to ICASSP 2024

arXiv:2403.19127 [pdf, ps, other]

Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Authors: Xinyu Bian, Yuhao Liu, Yizhou Xu, Tianqi Hou, Wenjie Wang, Yuyi Mao, Jun Zhang

Abstract: Inter-cell interference (ICI) suppression is critical for multi-cell multi-user networks. In this paper, we investigate advanced precoding techniques for coordinated multi-point (CoMP) with downlink coherent joint transmission, an effective approach for ICI suppression. Different from the centralized precoding schemes that require frequent information exchange among the cooperating base stations,… ▽ More Inter-cell interference (ICI) suppression is critical for multi-cell multi-user networks. In this paper, we investigate advanced precoding techniques for coordinated multi-point (CoMP) with downlink coherent joint transmission, an effective approach for ICI suppression. Different from the centralized precoding schemes that require frequent information exchange among the cooperating base stations, we propose a decentralized scheme to minimize the total power consumption. In particular, based on the covariance matrices of global channel state information, we estimate the ICI bounds via the deterministic equivalents and decouple the original design problem into sub-problems, each of which can be solved in a decentralized manner. To solve the sub-problems at each base station, we develop a low-complexity solver based on the alternating direction method of multipliers (ADMM) in conjunction with the convex-concave procedure (CCCP). Simulation results demonstrate the effectiveness of our proposed decentralized precoding scheme, which achieves performance similar to the optimal centralized precoding scheme. Besides, our proposed ADMM solver can substantially reduce the computational complexity, while maintaining outstanding performance. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.11155 [pdf, other]

Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

Authors: Yixiang Mao, Liyang Sun, Yong Liu, Yao Wang

Abstract: For $360^{\circ}$ video streaming, FoV-adaptive coding that allocates more bits for the predicted user's field of view (FoV) is an effective way to maximize the rendered video quality under the limited bandwidth. We develop a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors. To minimize the end-to-end… ▽ More For $360^{\circ}$ video streaming, FoV-adaptive coding that allocates more bits for the predicted user's field of view (FoV) is an effective way to maximize the rendered video quality under the limited bandwidth. We develop a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors. To minimize the end-to-end delay and yet maximize the coding efficiency, we propose a frame-level FoV-adaptive inter-coding structure. In each frame, regions that are in or near the predicted FoV are coded using temporal and spatial prediction, while a small rotating region is coded with spatial prediction only. This rotating intra region periodically refreshes the entire frame, thereby providing robustness to both FoV prediction errors and frame losses due to transmission errors. The system adapts the sizes and rates of different regions for each video segment to maximize the rendered video quality under the predicted bandwidth constraint. Integrating such frame-level FoV adaptation with temporal prediction is challenging due to the temporal variations of the FoV. We propose novel ways for modeling the influence of FoV dynamics on the quality-rate performance of temporal predictive coding.We further develop LSTM-based machine learning models to predict the user's FoV and network bandwidth.The proposed system is compared with three benchmark systems, using real-world network bandwidth traces and FoV traces, and is shown to significantly improve the rendered video quality, while achieving very low end-to-end delay and low frame-freeze probability. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.09958 [pdf, other]

Decentralizing Coherent Joint Transmission Precoding via Deterministic Equivalents

Authors: Yuhao Liu, Xinyu Bian, Yizhou Xu, Tianqi Hou, Wenjie Wang, Yuyi Mao, Jun Zhang

Abstract: In order to control the inter-cell interference for a multi-cell multi-user multiple-input multiple-output network, we consider the precoder design for coordinated multi-point with downlink coherent joint transmission. To avoid costly information exchange among the cooperating base stations in a centralized precoding scheme, we propose a decentralized one by considering the power minimization prob… ▽ More In order to control the inter-cell interference for a multi-cell multi-user multiple-input multiple-output network, we consider the precoder design for coordinated multi-point with downlink coherent joint transmission. To avoid costly information exchange among the cooperating base stations in a centralized precoding scheme, we propose a decentralized one by considering the power minimization problem. By approximating the inter-cell interference using the deterministic equivalents, this problem is decoupled to sub-problems which are solved in a decentralized manner at different base stations. Simulation results demonstrate the effectiveness of our proposed decentralized precoding scheme, where only 2 ~ 7% more transmit power is needed compared with the optimal centralized precoder. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.17996 [pdf, ps, other]

Joint Activity-Delay Detection and Channel Estimation for Asynchronous Massive Random Access: A Free Probability Theory Approach

Authors: Xinyu Bian, Yuyi Mao, Jun Zhang

Abstract: Grant-free random access (RA) has been recognized as a promising solution to support massive connectivity due to the removal of the uplink grant request procedures. While most endeavours assume perfect synchronization among users and the base station, this paper investigates asynchronous grant-free massive RA, and develop efficient algorithms for joint user activity detection, synchronization dela… ▽ More Grant-free random access (RA) has been recognized as a promising solution to support massive connectivity due to the removal of the uplink grant request procedures. While most endeavours assume perfect synchronization among users and the base station, this paper investigates asynchronous grant-free massive RA, and develop efficient algorithms for joint user activity detection, synchronization delay detection, and channel estimation. Considering the sparsity on user activity, we formulate a sparse signal recovery problem and propose to utilize the framework of orthogonal approximate message passing (OAMP) to deal with the non-independent and identically distributed (i.i.d.) Gaussian pilot matrices caused by the synchronization delays. In particular, an OAMP-based algorithm is developed to fully harness the common sparsity among received pilot signals from multiple base station antennas. To reduce the computational complexity, we further propose a free probability AMP (FPAMP)-based algorithm, which exploits the rectangular free cumulants to make the cost-effective AMP framework compatible to general pilot matrices. Simulation results demonstrate that the two proposed algorithms outperform various baselines, and the FPAMP-based algorithm reduces 40% of the computations while maintaining comparable detection/estimation accuracy with the OAMP-based algorithm. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.12372

arXiv:2402.17487 [pdf, other]

Bit Rate Matching Algorithm Optimization in JPEG-AI Verification Model

Authors: Panqi Jia, A. Burakhan Koyuncu, Jue Mao, Ze Cui, Yi Ma, Tiansheng Guo, Timofey Solovyev, Alexander Karabutov, Yin Zhao, Jing Wang, Elena Alshina, Andre Kaup

Abstract: The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties… ▽ More The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted at (IEEE) PCS 2024; 6 pages

arXiv:2402.16153 [pdf, other]

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Authors: Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, Jingcheng Wu, Chenghua Lin, Qifeng Liu , et al. (10 additional authors not shown)

Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the… ▽ More While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub. △ Less

Submitted 25 February, 2024; originally announced February 2024.

Comments: GitHub: https://shanghaicannon.github.io/ChatMusician/

arXiv:2402.15334 [pdf, other]

Iterative Inversion of (ELAA-)MIMO Channels Using Symmetric Rank-$1$ Regularization

Authors: Jinfei Wang, Yi Ma, Rahim Tafazolli

Abstract: While iterative matrix inversion methods excel in computational efficiency, memory optimization, and support for parallel and distributed computing when managing large matrices, their limitations are also evident in multiple-input multiple-output (MIMO) fading channels. These methods encounter challenges related to slow convergence and diminished accuracy, especially in ill-conditioned scenarios,… ▽ More While iterative matrix inversion methods excel in computational efficiency, memory optimization, and support for parallel and distributed computing when managing large matrices, their limitations are also evident in multiple-input multiple-output (MIMO) fading channels. These methods encounter challenges related to slow convergence and diminished accuracy, especially in ill-conditioned scenarios, hindering their application in future MIMO networks such as extra-large aperture array (ELAA). To address these challenges, this paper proposes a novel matrix regularization method termed symmetric rank-$1$ regularization (SR-$1$R). The proposed method functions by augmenting the channel matrix with a symmetric rank-$1$ matrix, with the primary goal of minimizing the condition number of the resultant regularized matrix. This significantly improves the matrix condition, enabling fast and accurate iterative inversion of the regularized matrix. Then, the inverse of the original channel matrix is obtained by applying the Sherman-Morrison transform on the outcome of iterative inversions. Our eigenvalue analysis unveils the best channel condition that can be achieved by an optimized SR-$1$R matrix. Moreover, a power iteration-assisted (PIA) approach is proposed to find the optimum SR-$1$R matrix without need of eigenvalue decomposition. The proposed approach exhibits logarithmic algorithm-depth in parallel computing for MIMO precoding. Finally, computer simulations demonstrate that SR-$1$R has the potential to reduce iterative iterations by up to $33\%$, while also significantly improve symbol error probability by approximately an order of magnitude. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 13 pages, 12 figures

arXiv:2402.15047 [pdf]

Networked Collaborative Sensing using Multi-domain Measurements: Architectures, Performance Limits and Algorithms

Authors: Yihua Ma, Shuqiang Xia, Chen bai, Yuxin Wang, Zhongbin Wang, Songqian Li

Abstract: As a promising 6G technology, integrated sensing and communication (ISAC) gains growing interest. ISAC provides integration gain via sharing spectrum, hardware, and software. However, concerns exist regarding its sensing performance when compared to dedicated radar systems. To address this issue, the advantages of widely deployed networks should be utilized, and this paper proposes networked colla… ▽ More As a promising 6G technology, integrated sensing and communication (ISAC) gains growing interest. ISAC provides integration gain via sharing spectrum, hardware, and software. However, concerns exist regarding its sensing performance when compared to dedicated radar systems. To address this issue, the advantages of widely deployed networks should be utilized, and this paper proposes networked collaborative sensing (NCS) using multi-domain measurements (MM), including range, Doppler, and two-dimension angle of arrival. In the NCS-MM architecture, this paper proposes a novel multi-domain decoupling model and a novel guard band-based protocol. The proposed model simplifies multi-domain derivations and algorithm designs, and the proposed protocol conserves resources and mitigates NCS interference. To determine the performance limits, this paper derives the Cramér-Rao lower bound (CRLB) of three-dimension position and velocity in NCS-MM. An accumulated single-dimension channel model is used to obtain the CRLB of MM, which is proven to be equivalent to that of the multi-dimension model. The algorithms of both MM estimation and fusion are proposed. An arbitrary-dimension Newtonized orthogonal matched pursuit (AD-NOMP) is proposed to accurately estimate grid-less MM. The degree-of-freedom (DoF) of MM is analyzed, and a novel DoF-based two-stage weighted least squares (TSWLS) is proposed to reduce equations without DoF loss. The numerical results show that the performances of the proposed algorithms are close to their performance limits. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.10071 [pdf, other]

Approximate Message Passing-Enhanced Graph Neural Network for OTFS Data Detection

Authors: Wenhao Zhuang, Yuyi Mao, Hengtao He, Lei Xie, Shenghui Song, Yao Ge, Zhi Ding

Abstract: Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further mini… ▽ More Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further minimize the data detection error of OTFS systems, this letter develops an AMP-GNN-based detector, leveraging the approximate message passing (AMP) algorithm to iteratively improve the symbol estimates of a GNN. Given the inter-Doppler interference (IDI) symbols incur substantial computational overhead to the constructed GNN, learning-based IDI approximation is implemented to sustain low detection complexity. Simulation results demonstrate a remarkable bit error rate (BER) performance achieved by the proposed AMP-GNN-based detector compared to existing baselines. Meanwhile, the proposed IDI approximation scheme avoids a large amount of computations with negligible BER degradation. △ Less

Submitted 14 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 8 pages, 7 figures, and 3 tables. Part of this article was submitted to IEEE for possible publication

arXiv:2402.08445 [pdf, other]

$1$-Bit SubTHz RIS with Planar Tightly Coupled Dipoles: Beam Shaping and Prototypes

Authors: Xianjun Ma, Yonggang Zhou, Qi Luo, Yihan Ma, Kyriakos Stylianopoulos, George C. Alexandropoulos

Abstract: In this paper, a proof-of-concept study of a $1$-bit wideband reconfigurable intelligent surface (RIS) comprising planar tightly coupled dipoles (PTCD) is presented. The developed RIS operates at subTHz frequencies and a $3$-dB gain bandwidth of $27.4\%$ with the center frequency at $102$ GHz is shown to be obtainable via full-wave electromagnetic simulations. The binary phase shift offered by eac… ▽ More In this paper, a proof-of-concept study of a $1$-bit wideband reconfigurable intelligent surface (RIS) comprising planar tightly coupled dipoles (PTCD) is presented. The developed RIS operates at subTHz frequencies and a $3$-dB gain bandwidth of $27.4\%$ with the center frequency at $102$ GHz is shown to be obtainable via full-wave electromagnetic simulations. The binary phase shift offered by each RIS unit element is enabled by changing the polarization of the reflected wave by $180^\circ$. The proposed PTCD-based RIS has a planar configuration with one dielectric layer bonded to a ground plane, and hence, it can be fabricated by using cost-effective printed circuit board (PCB) technology. We analytically calculate the response of the entire designed RIS and showcase that a good agreement between that result and equivalent full-wave simulations is obtained. To efficiently compute the $1$-bit RIS response for different pointing directions, thus, designing a directive beam codebook, we devise a fast approximate beamforming optimization approach, which is compared with time-consuming full-wave simulations. Finally, to prove our concept, we present several passive prototypes with frozen beams for the proposed $1$-bit wideband RIS. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 5 pages, 11 figures, 18th European Conference on Antennas and Propagation (EuCAP) - to be presented

arXiv:2401.17014 [pdf, other]

Near-Field Fading Channel Modeling for ELAAs: From Communication to ISAC

Authors: Jiuyu Liu, Yi Ma, Ahmed Elzanaty, Rahim Tafazolli

Abstract: Extremely large aperture array (ELAA) is anticipated to serve as a pivotal feature of future multiple-input multiple-output (MIMO) systems in 6G. Near-field (NF) fading channel models are essential for reliable link-level simulation and ELAA system design. In this article, we propose a framework designed to generate NF fading channels for both communication and integrated sensing and communication… ▽ More Extremely large aperture array (ELAA) is anticipated to serve as a pivotal feature of future multiple-input multiple-output (MIMO) systems in 6G. Near-field (NF) fading channel models are essential for reliable link-level simulation and ELAA system design. In this article, we propose a framework designed to generate NF fading channels for both communication and integrated sensing and communication (ISAC) applications. The framework allows a mixed of line of sight (LoS) and non-LoS (NLoS) links. It also considers spherical wave model and spatially non-stationary shadow fading. Based on this framework, we propose a three-dimensional (3D) fading channel model for ELAA systems deployed with a uniform rectangular array (URA). It can capture the impact of sensing object for ISAC applications. Moreover, all parameters involved in the framework are based on specifications or measurements from the 3rd Generation Partnership Project (3GPP) documents. Therefore, the proposed framework and channel model have the potential to contribute to the standard in various aspects, including ISAC, extra-large (XL-) MIMO, and reconfigurable intelligent surface (RIS) aided MIMO systems. Finally, future directions for ELAA are presented, including not only NF channel modeling but also the design of next-generation transceivers. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2401.15955 [pdf]

A Novel Geometric Solution for Moving Target Localization through Multistatic Sensing in the ISAC System

Authors: S. Zhuge, Y. Ma, Z. Lin, Y. Zeng

Abstract: This paper proposes a novel geometric solution for tracking a moving target through multistatic sensing. In contrast to existing two-step weighted least square (2SWLS) methods which use the bistatic range (BR) and bistatic range rate (BRR) measurements, the proposed method incorporates an additional direction of arrival (DOA) measurement of the target obtained from a communication receiver in an i… ▽ More This paper proposes a novel geometric solution for tracking a moving target through multistatic sensing. In contrast to existing two-step weighted least square (2SWLS) methods which use the bistatic range (BR) and bistatic range rate (BRR) measurements, the proposed method incorporates an additional direction of arrival (DOA) measurement of the target obtained from a communication receiver in an integrated sensing and communication (ISAC) system. Unlike the existing 2SWLS methods that require at least three transmitter-receiver (TX-RX) pairs to operate, the proposed algorithm can conduct location estimation with a single TX-RX pair and velocity estimation with two TX-RX pairs. Simulations reveal that the proposed method exhibits superior performance compared to existing 2SWLS methods, particularly when dealing with moderate levels of noise in DOA measurements. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.15105 [pdf, other]

Diffusion Enhancement for Cloud Removal in Ultra-Resolution Remote Sensing Imagery

Authors: Jialu Sui, Yiyang Ma, Wenhan Yang, Xiaokang Zhang, Man-On Pun, Jiaying Liu

Abstract: The presence of cloud layers severely compromises the quality and effectiveness of optical remote sensing (RS) images. However, existing deep-learning (DL)-based Cloud Removal (CR) techniques encounter difficulties in accurately reconstructing the original visual authenticity and detailed semantic content of the images. To tackle this challenge, this work proposes to encompass enhancements at the… ▽ More The presence of cloud layers severely compromises the quality and effectiveness of optical remote sensing (RS) images. However, existing deep-learning (DL)-based Cloud Removal (CR) techniques encounter difficulties in accurately reconstructing the original visual authenticity and detailed semantic content of the images. To tackle this challenge, this work proposes to encompass enhancements at the data and methodology fronts. On the data side, an ultra-resolution benchmark named CUHK Cloud Removal (CUHK-CR) of 0.5m spatial resolution is established. This benchmark incorporates rich detailed textures and diverse cloud coverage, serving as a robust foundation for designing and assessing CR models. From the methodology perspective, a novel diffusion-based framework for CR called Diffusion Enhancement (DE) is proposed to perform progressive texture detail recovery, which mitigates the training difficulty with improved inference accuracy. Additionally, a Weight Allocation (WA) network is developed to dynamically adjust the weights for feature fusion, thereby further improving performance, particularly in the context of ultra-resolution image generation. Furthermore, a coarse-to-fine training strategy is applied to effectively expedite training convergence while reducing the computational complexity required to handle ultra-resolution images. Extensive experiments on the newly established CUHK-CR and existing datasets such as RICE confirm that the proposed DE framework outperforms existing DL-based methods in terms of both perceptual quality and signal fidelity. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.14978 [pdf, other]

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Authors: Zhuojiang Cai, Yuhan Ma, Feng Lu

Abstract: While speech interaction finds widespread utility within the Extended Reality (XR) domain, conventional vocal speech keyword spotting systems continue to grapple with formidable challenges, including suboptimal performance in noisy environments, impracticality in situations requiring silence, and susceptibility to inadvertent activations when others speak nearby. These challenges, however, can pot… ▽ More While speech interaction finds widespread utility within the Extended Reality (XR) domain, conventional vocal speech keyword spotting systems continue to grapple with formidable challenges, including suboptimal performance in noisy environments, impracticality in situations requiring silence, and susceptibility to inadvertent activations when others speak nearby. These challenges, however, can potentially be surmounted through the cost-effective fusion of voice and lip movement information. Consequently, we propose a novel vocal-echoic dual-modal keyword spotting system designed for XR headsets. We devise two different modal fusion approches and conduct experiments to test the system's performance across diverse scenarios. The results show that our dual-modal system not only consistently outperforms its single-modal counterparts, demonstrating higher precision in both typical and noisy environments, but also excels in accurately identifying silent utterances. Furthermore, we have successfully applied the system in real-time demonstrations, achieving promising results. The code is available at https://github.com/caizhuojiang/VE-KWS. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Accepted to IEEE VR 2024

arXiv:2401.10392 [pdf, other]

Deep learning and random light structuring ensure robust free-space communications

Authors: Xiaofei Li, Yu Wang, Xin Liu, Yuan Ma, Yangjian Cai, Sergey A. Ponomarenko, Xianlong Liu

Abstract: Having shown early promise, free-space optical communications (FSO) face formidable challenges in the age of information explosion. The ever-growing demand for greater channel communication capacity is one of the challenges. The inter-channel crosstalk, which severely degrades the quality of transmitted information, creates another roadblock in the way of efficient FSO implementation. Here we adva… ▽ More Having shown early promise, free-space optical communications (FSO) face formidable challenges in the age of information explosion. The ever-growing demand for greater channel communication capacity is one of the challenges. The inter-channel crosstalk, which severely degrades the quality of transmitted information, creates another roadblock in the way of efficient FSO implementation. Here we advance theoretically and realize experimentally a potentially high-capacity FSO protocol that enables high-fidelity transfer of an image, or set of images through a complex environment. In our protocol, we complement random light structuring at the transmitter with a deep learning image classification platform at the receiver. Multiplexing novel, independent, mutually orthogonal degrees of freedom available to structured random light can potentially significantly boost the channel communication capacity of our protocol without introducing any deleterious crosstalk. Specifically, we show how one can multiplex the degrees of freedom associated with the source coherence radius and a spatial position of a beamlet within an array of structured random beams to greatly enhance the capacity of our communication link. The superb resilience of structured random light to environmental noise, as well as extreme efficiency of deep learning networks at classifying images guarantees high-fidelity image transfer within the framework of our protocol. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 18 pages,13 figures

Showing 1–50 of 274 results for author: Ma, Y