-
Reconfigurable Intelligent Surface Aided Vehicular Edge Computing: Joint Phase-shift Optimization and Multi-User Power Allocation
Authors:
Kangwei Qi,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Khaled B. Letaief
Abstract:
Vehicular edge computing (VEC) is an emerging technology with significant potential in the field of internet of vehicles (IoV), enabling vehicles to perform intensive computational tasks locally or offload them to nearby edge devices. However, the quality of communication links may be severely deteriorated due to obstacles such as buildings, impeding the offloading process. To address this challen…
▽ More
Vehicular edge computing (VEC) is an emerging technology with significant potential in the field of internet of vehicles (IoV), enabling vehicles to perform intensive computational tasks locally or offload them to nearby edge devices. However, the quality of communication links may be severely deteriorated due to obstacles such as buildings, impeding the offloading process. To address this challenge, we introduce the use of Reconfigurable Intelligent Surfaces (RIS), which provide alternative communication pathways to assist vehicular communication. By dynamically adjusting the phase-shift of the RIS, the performance of VEC systems can be substantially improved. In this work, we consider a RIS-assisted VEC system, and design an optimal scheme for local execution power, offloading power, and RIS phase-shift, where random task arrivals and channel variations are taken into account. To address the scheme, we propose an innovative deep reinforcement learning (DRL) framework that combines the Deep Deterministic Policy Gradient (DDPG) algorithm for optimizing RIS phase-shift coefficients and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm for optimizing the power allocation of vehicle user (VU). Simulation results show that our proposed scheme outperforms the traditional centralized DDPG, Twin Delayed Deep Deterministic Policy Gradient (TD3) and some typical stochastic schemes.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Resource Allocation for Twin Maintenance and Computing Task Processing in Digital Twin Vehicular Edge Computing Network
Authors:
Yu Xie,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Jiangzhou Wang,
Khaled B. Letaief
Abstract:
As a promising technology, vehicular edge computing (VEC) can provide computing and caching services by deploying VEC servers near vehicles. However, VEC networks still face challenges such as high vehicle mobility. Digital twin (DT), an emerging technology, can predict, estimate, and analyze real-time states by digitally modeling objects in the physical world. By integrating DT with VEC, a virtua…
▽ More
As a promising technology, vehicular edge computing (VEC) can provide computing and caching services by deploying VEC servers near vehicles. However, VEC networks still face challenges such as high vehicle mobility. Digital twin (DT), an emerging technology, can predict, estimate, and analyze real-time states by digitally modeling objects in the physical world. By integrating DT with VEC, a virtual vehicle DT can be created in the VEC server to monitor the real-time operating status of vehicles. However, maintaining the vehicle DT model requires ongoing attention from the VEC server, which also needs to offer computing services for the vehicles. Therefore, effective allocation and scheduling of VEC server resources are crucial. This study focuses on a general VEC network with a single VEC service and multiple vehicles, examining the two types of delays caused by twin maintenance and computational processing within the network. By transforming the problem using satisfaction functions, we propose an optimization problem aimed at maximizing each vehicle's resource utility to determine the optimal resource allocation strategy. Given the non-convex nature of the issue, we employ multi-agent Markov decision processes to reformulate the problem. Subsequently, we propose the twin maintenance and computing task processing resource collaborative scheduling (MADRL-CSTC) algorithm, which leverages multi-agent deep reinforcement learning. Through experimental comparisons with alternative algorithms, it demonstrates that our proposed approach is effective in terms of resource allocation.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Enhancing Robustness and Security in ISAC Network Design: Leveraging Transmissive Reconfigurable Intelligent Surface with RSMA
Authors:
Ziwei Liu,
Wen Chen,
Qingqing Wu,
Zhendong Li,
Xusheng Zhu,
Qiong Wu,
Nan Cheng
Abstract:
In this paper, we propose a novel transmissive reconfigurable intelligent surface transceiver-enhanced robust and secure integrated sensing and communication network. A time-division sensing communication mechanism is designed for the scenario, which enables communication and sensing to share wireless resources. To address the interference management problem and hinder eavesdropping, we implement…
▽ More
In this paper, we propose a novel transmissive reconfigurable intelligent surface transceiver-enhanced robust and secure integrated sensing and communication network. A time-division sensing communication mechanism is designed for the scenario, which enables communication and sensing to share wireless resources. To address the interference management problem and hinder eavesdropping, we implement rate-splitting multiple access (RSMA), where the common stream is designed as a useful signal and an artificial noise, while taking into account the imperfect channel state information and modeling the channel for the illegal users in a fine-grained manner as well as giving an upper bound on the error. We introduce the secrecy outage probability and construct an optimization problem with secrecy sum-rate as the objective functions to optimize the common stream beamforming matrix, the private stream beamforming matrix and the timeslot duration variable. Due to the coupling of the optimization variables and the infinity of the error set, the proposed problem is a nonconvex optimization problem that cannot be solved directly. In order to address the above challenges, the block coordinate descent-based second-order cone programming algorithm is used to decouple the optimization variables and solving the problem. Specifically, the problem is decoupled into two subproblems concerning the common stream beamforming matrix, the private stream beamforming matrix, and the timeslot duration variable, which are solved by alternating optimization until convergence is reached. To solve the problem, S-procedure, Bernstein's inequality and successive convex approximation are employed to deal with the objective function and non-convex constraints. Numerical simulation results verify the superiority of the proposed scheme in improving the secrecy energy efficiency and the Cramér-Rao boundary.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Graph Neural Networks and Deep Reinforcement Learning Based Resource Allocation for V2X Communications
Authors:
Maoxin Ji,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Jiangzhou Wang,
Khaled B. Letaief
Abstract:
In the rapidly evolving landscape of Internet of Vehicles (IoV) technology, Cellular Vehicle-to-Everything (C-V2X) communication has attracted much attention due to its superior performance in coverage, latency, and throughput. Resource allocation within C-V2X is crucial for ensuring the transmission of safety information and meeting the stringent requirements for ultra-low latency and high reliab…
▽ More
In the rapidly evolving landscape of Internet of Vehicles (IoV) technology, Cellular Vehicle-to-Everything (C-V2X) communication has attracted much attention due to its superior performance in coverage, latency, and throughput. Resource allocation within C-V2X is crucial for ensuring the transmission of safety information and meeting the stringent requirements for ultra-low latency and high reliability in Vehicle-to-Vehicle (V2V) communication. This paper proposes a method that integrates Graph Neural Networks (GNN) with Deep Reinforcement Learning (DRL) to address this challenge. By constructing a dynamic graph with communication links as nodes and employing the Graph Sample and Aggregation (GraphSAGE) model to adapt to changes in graph structure, the model aims to ensure a high success rate for V2V communication while minimizing interference on Vehicle-to-Infrastructure (V2I) links, thereby ensuring the successful transmission of V2V link information and maintaining high transmission rates for V2I links. The proposed method retains the global feature learning capabilities of GNN and supports distributed network deployment, allowing vehicles to extract low-dimensional features that include structural information from the graph network based on local observations and to make independent resource allocation decisions. Simulation results indicate that the introduction of GNN, with a modest increase in computational load, effectively enhances the decision-making quality of agents, demonstrating superiority to other methods. This study not only provides a theoretically efficient resource allocation strategy for V2V and V2I communications but also paves a new technical path for resource management in practical IoV environments.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Disciplined Geodesically Convex Programming
Authors:
Andrew Cheng,
Vaibhav Dixit,
Melanie Weber
Abstract:
Convex programming plays a fundamental role in machine learning, data science, and engineering. Testing convexity structure in nonlinear programs relies on verifying the convexity of objectives and constraints. \citet{grant2006disciplined} introduced a framework, Disciplined Convex Programming (DCP), for automating this verification task for a wide range of convex functions that can be decomposed…
▽ More
Convex programming plays a fundamental role in machine learning, data science, and engineering. Testing convexity structure in nonlinear programs relies on verifying the convexity of objectives and constraints. \citet{grant2006disciplined} introduced a framework, Disciplined Convex Programming (DCP), for automating this verification task for a wide range of convex functions that can be decomposed into basic convex functions (atoms) using convexity-preserving compositions and transformations (rules). However, the restriction to Euclidean convexity concepts can limit the applicability of the framework. For instance, many notable instances of statistical estimators and matrix-valued (sub)routines in machine learning applications are Euclidean non-convex, but exhibit geodesic convexity through a more general Riemannian lens. In this work, we extend disciplined programming to this setting by introducing Disciplined Geodesically Convex Programming (DGCP). We determine convexity-preserving compositions and transformations for geodesically convex functions on general Cartan-Hadamard manifolds, as well as for the special case of symmetric positive definite matrices, a common setting in matrix-valued optimization. For the latter, we also define a basic set of atoms. Our paper is accompanied by a Julia package SymbolicAnalysis.jl, which provides functionality for testing and certifying DGCP-compliant expressions. Our library interfaces with manifold optimization software, which allows for directly solving verified geodesically convex programs.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Reliable Projection Based Unsupervised Learning for Semi-Definite QCQP with Application of Beamforming Optimization
Authors:
Xiucheng Wang,
Qi Qiu,
Nan Cheng
Abstract:
In this paper, we investigate a special class of quadratic-constrained quadratic programming (QCQP) with semi-definite constraints. Traditionally, since such a problem is non-convex and N-hard, the neural network (NN) is regarded as a promising method to obtain a high-performing solution. However, due to the inherent prediction error, it is challenging to ensure all solution output by the NN is fe…
▽ More
In this paper, we investigate a special class of quadratic-constrained quadratic programming (QCQP) with semi-definite constraints. Traditionally, since such a problem is non-convex and N-hard, the neural network (NN) is regarded as a promising method to obtain a high-performing solution. However, due to the inherent prediction error, it is challenging to ensure all solution output by the NN is feasible. Although some existing methods propose some naive methods, they only focus on reducing the constraint violation probability, where not all solutions are feasibly guaranteed. To deal with the above challenge, in this paper a computing efficient and reliable projection is proposed, where all solution output by the NN are ensured to be feasible. Moreover, unsupervised learning is used, so the NN can be trained effectively and efficiently without labels. Theoretically, the solution of the NN after projection is proven to be feasible, and we also prove the projection method can enhance the convergence performance and speed of the NN. To evaluate our proposed method, the quality of service (QoS)-contained beamforming scenario is studied, where the simulation results show the proposed method can achieve high-performance which is competitive with the lower bound.
△ Less
Submitted 9 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Optimizing Age of Information in Vehicular Edge Computing with Federated Graph Neural Network Multi-Agent Reinforcement Learning
Authors:
Wenhua Wang,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Jiangzhou Wang,
Khaled B. Letaief
Abstract:
With the rapid development of intelligent vehicles and Intelligent Transport Systems (ITS), the sensors such as cameras and LiDAR installed on intelligent vehicles provides higher capacity of executing computation-intensive and delay-sensitive tasks, thereby raising deployment costs. To address this issue, Vehicular Edge Computing (VEC) has been proposed to process data through Road Side Units (RS…
▽ More
With the rapid development of intelligent vehicles and Intelligent Transport Systems (ITS), the sensors such as cameras and LiDAR installed on intelligent vehicles provides higher capacity of executing computation-intensive and delay-sensitive tasks, thereby raising deployment costs. To address this issue, Vehicular Edge Computing (VEC) has been proposed to process data through Road Side Units (RSUs) to support real-time applications. This paper focuses on the Age of Information (AoI) as a key metric for data freshness and explores task offloading issues for vehicles under RSU communication resource constraints. We adopt a Multi-agent Deep Reinforcement Learning (MADRL) approach, allowing vehicles to autonomously make optimal data offloading decisions. However, MADRL poses risks of vehicle information leakage during communication learning and centralized training. To mitigate this, we employ a Federated Learning (FL) framework that shares model parameters instead of raw data to protect the privacy of vehicle users. Building on this, we propose an innovative distributed federated learning framework combining Graph Neural Networks (GNN), named Federated Graph Neural Network Multi-Agent Reinforcement Learning (FGNN-MADRL), to optimize AoI across the system. For the first time, road scenarios are constructed as graph data structures, and a GNN-based federated learning framework is proposed, effectively combining distributed and centralized federated aggregation. Furthermore, we propose a new MADRL algorithm that simplifies decision making and enhances offloading efficiency, further reducing the decision complexity. Simulation results demonstrate the superiority of our proposed approach to other methods through simulations.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Trapezoidal Gradient Descent for Effective Reinforcement Learning in Spiking Networks
Authors:
Yuhao Pan,
Xiucheng Wang,
Nan Cheng,
Qi Qiu
Abstract:
With the rapid development of artificial intelligence technology, the field of reinforcement learning has continuously achieved breakthroughs in both theory and practice. However, traditional reinforcement learning algorithms often entail high energy consumption during interactions with the environment. Spiking Neural Network (SNN), with their low energy consumption characteristics and performance…
▽ More
With the rapid development of artificial intelligence technology, the field of reinforcement learning has continuously achieved breakthroughs in both theory and practice. However, traditional reinforcement learning algorithms often entail high energy consumption during interactions with the environment. Spiking Neural Network (SNN), with their low energy consumption characteristics and performance comparable to deep neural networks, have garnered widespread attention. To reduce the energy consumption of practical applications of reinforcement learning, researchers have successively proposed the Pop-SAN and MDC-SAN algorithms. Nonetheless, these algorithms use rectangular functions to approximate the spike network during the training process, resulting in low sensitivity, thus indicating room for improvement in the training effectiveness of SNN. Based on this, we propose a trapezoidal approximation gradient method to replace the spike network, which not only preserves the original stable learning state but also enhances the model's adaptability and response sensitivity under various signal dynamics. Simulation results show that the improved algorithm, using the trapezoidal approximation gradient to replace the spike network, achieves better convergence speed and performance compared to the original algorithm and demonstrates good training stability.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Constructing and Evaluating Digital Twins: An Intelligent Framework for DT Development
Authors:
Longfei Ma,
Nan Cheng,
Xiucheng Wang,
Jiong Chen,
Yinjun Gao,
Dongxiao Zhang,
Jun-Jie Zhang
Abstract:
The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, spec…
▽ More
The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, specifically designed to enhance the accuracy and utility of DTs in testing algorithmic performance. We propose a novel construction methodology that integrates deep learning-based policy gradient techniques to dynamically tune the DT parameters, ensuring high fidelity in the digital replication of physical systems. Moreover, the Mean STate Error (MSTE) is proposed as a robust metric for evaluating the performance of algorithms within these digital space. The efficacy of our framework is demonstrated through extensive simulations that show our DT not only accurately mirrors the physical reality but also provides a reliable platform for algorithm evaluation. This work lays a foundation for future research into DT technologies, highlighting pathways for both theoretical enhancements and practical implementations in various industries.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
PFID: Privacy First Inference Delegation Framework for LLMs
Authors:
Haoyan Yang,
Zhitao Li,
Yong Zhang,
Jianzong Wang,
Ning Cheng,
Ming Li,
Jing Xiao
Abstract:
This paper introduces a novel privacy-preservation framework named PFID for LLMs that addresses critical privacy concerns by localizing user data through model sharding and singular value decomposition. When users are interacting with LLM systems, their prompts could be subject to being exposed to eavesdroppers within or outside LLM system providers who are interested in collecting users' input. I…
▽ More
This paper introduces a novel privacy-preservation framework named PFID for LLMs that addresses critical privacy concerns by localizing user data through model sharding and singular value decomposition. When users are interacting with LLM systems, their prompts could be subject to being exposed to eavesdroppers within or outside LLM system providers who are interested in collecting users' input. In this work, we proposed a framework to camouflage user input, so as to alleviate privacy issues. Our framework proposes to place model shards on the client and the public server, we sent compressed hidden states instead of prompts to and from servers. Clients have held back information that can re-privatized the hidden states so that overall system performance is comparable to traditional LLMs services. Our framework was designed to be communication efficient, computation can be delegated to the local client so that the server's computation burden can be lightened. We conduct extensive experiments on machine translation tasks to verify our framework's performance.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning
Authors:
Kangwei Qi,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Qiang Fan,
Jiangzhou Wang
Abstract:
Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS)…
▽ More
Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks
Authors:
Kangwei Qi,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Jiangzhou Wang,
Khaled B. Letaief
Abstract:
Reconfigurable Intelligent Surface (RIS) is a pivotal technology in communication, offering an alternative path that significantly enhances the link quality in wireless communication environments. In this paper, we propose a RIS-assisted internet of vehicles (IoV) network, considering the vehicle-to-everything (V2X) communication method. In addition, in order to improve the timeliness of vehicle-t…
▽ More
Reconfigurable Intelligent Surface (RIS) is a pivotal technology in communication, offering an alternative path that significantly enhances the link quality in wireless communication environments. In this paper, we propose a RIS-assisted internet of vehicles (IoV) network, considering the vehicle-to-everything (V2X) communication method. In addition, in order to improve the timeliness of vehicle-to-infrastructure (V2I) links and the stability of vehicle-to-vehicle (V2V) links, we introduce the age of information (AoI) model and the payload transmission probability model. Therefore, with the objective of minimizing the AoI of V2I links and prioritizing transmission of V2V links payload, we construct this optimization problem as an Markov decision process (MDP) problem in which the BS serves as an agent to allocate resources and control phase-shift for the vehicles using the soft actor-critic (SAC) algorithm, which gradually converges and maintains a high stability. A AoI-aware joint vehicular resource allocation and RIS phase-shift control scheme based on SAC algorithm is proposed and simulation results show that its convergence speed, cumulative reward, AoI performance, and payload transmission probability outperforms those of proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin delayed deep deterministic policy gradient (TD3) and stochastic algorithms.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Multiple Intelligent Reflecting Surfaces Collaborative Wireless Localization System
Authors:
Ziheng Zhang,
Wen Chen,
Qingqing Wu,
Zhendong Li,
Xusheng Zhu,
Jingfeng Chen,
Nan Cheng
Abstract:
This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multipl…
▽ More
This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multiple targets and process echo signals, respectively. Based on the above model, we derive the Fisher information matrix of the echo signal with respect to the time delay. By employing the chain rule and exploiting the geometric relationship between time delay and position, the Cramer-Rao bound (CRB) for estimating the target's Cartesian coordinate position is derived. Then, we propose a two-stage algorithmic framework to minimize CRB in single- and multi-target localization systems by joint optimizing active beamforming at BS, passive beamforming at multiple IRSs and IRS selection. For the single-target case, we derive the optimal closed-form solution for multiple IRSs coefficients design and propose a lowcomplexity algorithm based on alternating direction method of multipliers to obtain the optimal solution for active beaming design. For the multi-target case, alternating optimization is used to transform the original problem into two subproblems where semi-definite relaxation and successive convex approximation are applied to tackle the quadraticity and indefiniteness in the CRB expression, respectively. Finally, numerical simulation results validate the effectiveness of the proposed algorithm for multiple IRSs collaborative localization system compared to other benchmark schemes as well as the significant performance gains.
△ Less
Submitted 17 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Semantic-Aware Resource Allocation Based on Deep Reinforcement Learning for 5G-V2X HetNets
Authors:
Zhiyu Shao,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Qiang Fan,
Jiangzhou Wang
Abstract:
This letter proposes a semantic-aware resource allocation (SARA) framework with flexible duty cycle (DC) coexistence mechanism (SARADC) for 5G-V2X Heterogeneous Network (HetNets) based on deep reinforcement learning (DRL) proximal policy optimization (PPO). Specifically, we investigate V2X networks within a two-tiered HetNets structure. In response to the needs of high-speed vehicular networking i…
▽ More
This letter proposes a semantic-aware resource allocation (SARA) framework with flexible duty cycle (DC) coexistence mechanism (SARADC) for 5G-V2X Heterogeneous Network (HetNets) based on deep reinforcement learning (DRL) proximal policy optimization (PPO). Specifically, we investigate V2X networks within a two-tiered HetNets structure. In response to the needs of high-speed vehicular networking in urban environments, we design a semantic communication system and introduce two resource allocation metrics: high-speed semantic transmission rate (HSR) and semantic spectrum efficiency (HSSE). Our main goal is to maximize HSSE. Additionally, we address the coexistence of vehicular users and WiFi users in 5G New Radio Unlicensed (NR-U) networks. To tackle this complex challenge, we propose a novel approach that jointly optimizes flexible DC coexistence mechanism and the allocation of resources and base stations (BSs). Unlike traditional bit transmission methods, our approach integrates the semantic communication paradigm into the communication system. Experimental results demonstrate that our proposed solution outperforms traditional bit transmission methods with traditional DC coexistence mechanism in terms of HSSE and semantic throughput (ST) for both vehicular and WiFi users.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Toward Enhanced Reinforcement Learning-Based Resource Management via Digital Twin: Opportunities, Applications, and Challenges
Authors:
Nan Cheng,
Xiucheng Wang,
Zan Li,
Zhisheng Yin,
Tom Luan,
Xuemin Shen
Abstract:
This article presents a digital twin (DT)-enhanced reinforcement learning (RL) framework aimed at optimizing performance and reliability in network resource management, since the traditional RL methods face several unified challenges when applied to physical networks, including limited exploration efficiency, slow convergence, poor long-term performance, and safety concerns during the exploration…
▽ More
This article presents a digital twin (DT)-enhanced reinforcement learning (RL) framework aimed at optimizing performance and reliability in network resource management, since the traditional RL methods face several unified challenges when applied to physical networks, including limited exploration efficiency, slow convergence, poor long-term performance, and safety concerns during the exploration phase. To deal with the above challenges, a comprehensive DT-based framework is proposed to enhance the convergence speed and performance for unified RL-based resource management. The proposed framework provides safe action exploration, more accurate estimates of long-term returns, faster training convergence, higher convergence performance, and real-time adaptation to varying network conditions. Then, two case studies on ultra-reliable and low-latency communication (URLLC) services and multiple unmanned aerial vehicles (UAV) network are presented, demonstrating improvements of the proposed framework in performance, convergence speed, and training cost reduction both on traditional RL and neural network based Deep RL (DRL). Finally, the article identifies and explores some of the research challenges and open issues in this rapidly evolving field.
△ Less
Submitted 15 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Erasing Radio Frequency Fingerprints via Active Adversarial Perturbation
Authors:
Zhaoyi Lu,
Wenchao Xu,
Ming Tu,
Xin Xie,
Cunqing Hua,
Nan Cheng
Abstract:
Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its s…
▽ More
Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its signal. Such fingerprint may expose critical private information, e.g., the associated upper-layer applications or the end-user. In this paper, we propose to erase such RF feature for wireless devices, which can prevent fingerprinting by actively perturbation from the signal perspective. Specifically, we consider a common RF fingerprinting scenario, where machine learning models are trained from pilot signal data for identification. A novel adversarial attack solution is designed to generate proper perturbations, whereby the perturbed pilot signal can hide the hardware feature and misclassify the model. We theoretically show that the perturbation would not affect the communication function within a tolerable perturbation threshold. We also implement the pilot signal fingerprinting and the proposed perturbation process in a practical LTE system. Extensive experiment results demonstrate that the RF fingerprints can be effectively erased to protect the user privacy.
△ Less
Submitted 12 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Semantic-Aware Spectrum Sharing in Internet of Vehicles Based on Deep Reinforcement Learning
Authors:
Zhiyu Shao,
Qiong Wu,
Pingyi Fan,
Nan Cheng,
Wen Chen,
Jiangzhou Wang,
Khaled B. Letaief
Abstract:
This work aims to investigate semantic communication in high-speed mobile Internet of vehicles (IoV) environments, with a focus on the spectrum sharing between vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. We specifically address spectrum scarcity and network traffic and then propose a semantic-aware spectrum sharing algorithm (SSS) based on the deep reinforcement le…
▽ More
This work aims to investigate semantic communication in high-speed mobile Internet of vehicles (IoV) environments, with a focus on the spectrum sharing between vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. We specifically address spectrum scarcity and network traffic and then propose a semantic-aware spectrum sharing algorithm (SSS) based on the deep reinforcement learning (DRL) soft actor-critic (SAC) approach. Firstly, we delve into the extraction of semantic information. Secondly, we redefine metrics for semantic information in V2V and V2I spectrum sharing in IoV environments, introducing high-speed semantic spectrum efficiency (HSSE) and semantic transmission rate (HSR). Finally, we employ the SAC algorithm for decision optimization in V2V and V2I spectrum sharing based on semantic information. This optimization encompasses the optimal link of V2V and V2I sharing strategies, the transmission power for vehicles sending semantic information and the length of transmitted semantic symbols, aiming at maximizing HSSE of V2I and enhancing success rate of effective semantic information transmission (SRS) of V2V. Experimental results demonstrate that the SSS algorithm outperforms other baseline algorithms, including other traditional-communication-based spectrum sharing algorithms and spectrum sharing algorithm using other reinforcement learning approaches. The SSS algorithm exhibits a 15% increase in HSSE and approximately a 7% increase in SRS.
△ Less
Submitted 17 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation
Authors:
Ning Cheng,
Changhao Guan,
Jing Gao,
Weihao Wang,
You Li,
Fandong Meng,
Jie Zhou,
Bin Fang,
Jinan Xu,
Wenjuan Han
Abstract:
Touch holds a pivotal position in enhancing the perceptual and interactive capabilities of both humans and robots. Despite its significance, current tactile research mainly focuses on visual and tactile modalities, overlooking the language domain. Inspired by this, we construct Touch100k, a paired touch-language-vision dataset at the scale of 100k, featuring tactile sensation descriptions in multi…
▽ More
Touch holds a pivotal position in enhancing the perceptual and interactive capabilities of both humans and robots. Despite its significance, current tactile research mainly focuses on visual and tactile modalities, overlooking the language domain. Inspired by this, we construct Touch100k, a paired touch-language-vision dataset at the scale of 100k, featuring tactile sensation descriptions in multiple granularities (i.e., sentence-level natural expressions with rich semantics, including contextual and dynamic relationships, and phrase-level descriptions capturing the key features of tactile sensations). Based on the dataset, we propose a pre-training method, Touch-Language-Vision Representation Learning through Curriculum Linking (TLV-Link, for short), inspired by the concept of curriculum learning. TLV-Link aims to learn a tactile representation for the GelSight sensor and capture the relationship between tactile, language, and visual modalities. We evaluate our representation's performance across two task categories (namely, material property identification and robot grasping prediction), focusing on tactile representation and zero-shot touch understanding. The experimental evaluation showcases the effectiveness of our representation. By enabling TLV-Link to achieve substantial improvements and establish a new state-of-the-art in touch-centric multimodal representation learning, Touch100k demonstrates its value as a valuable resource for research. Project page: https://cocacola-lab.github.io/Touch100k/.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Movable Antenna Empowered Downlink NOMA Systems: Power Allocation and Antenna Position Optimization
Authors:
Yufeng Zhou,
Wen Chen,
Qingqing Wu,
Xusheng Zhu,
Nan Cheng
Abstract:
This paper investigates a novel communication paradigm employing movable antennas (MAs) within a multiple-input single-output (MISO) non-orthogonal multiple access (NOMA) downlink framework, where users are equipped with MAs. Initially, leveraging the far-field response, we delineate the channel characteristics concerning both the power allocation coefficient and positions of MAs. Subsequently, we…
▽ More
This paper investigates a novel communication paradigm employing movable antennas (MAs) within a multiple-input single-output (MISO) non-orthogonal multiple access (NOMA) downlink framework, where users are equipped with MAs. Initially, leveraging the far-field response, we delineate the channel characteristics concerning both the power allocation coefficient and positions of MAs. Subsequently, we endeavor to maximize the channel capacity by jointly optimizing power allocation and antenna positions. To tackle the resultant non-convex problem, we propose an alternating optimization (AO) scheme underpinned by successive convex approximation (SCA) to converge towards a stationary point. Through numerical simulations, our findings substantiate the superiority of the MA-assisted NOMA system over both orthogonal multiple access (OMA) and conventional NOMA configurations in terms of average sum rate and outage probability.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning
Authors:
Haoxiang Shi,
Xulong Zhang,
Ning Cheng,
Yong Zhang,
Jun Yu,
Jing Xiao,
Jianzong Wang
Abstract:
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared informa…
▽ More
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval
Authors:
Jianzong Wang,
Haoxiang Shi,
Kaiyi Luo,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing techniq…
▽ More
Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis
Authors:
Haoxiang Shi,
Jianzong Wang,
Xulong Zhang,
Ning Cheng,
Jun Yu,
Jing Xiao
Abstract:
Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information…
▽ More
Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Transformer in Touch: A Survey
Authors:
Jing Gao,
Ning Cheng,
Bin Fang,
Wenjuan Han
Abstract:
The Transformer model, initially achieving significant success in the field of natural language processing, has recently shown great potential in the application of tactile perception. This review aims to comprehensively outline the application and development of Transformers in tactile technology. We first introduce the two fundamental concepts behind the success of the Transformer: the self-atte…
▽ More
The Transformer model, initially achieving significant success in the field of natural language processing, has recently shown great potential in the application of tactile perception. This review aims to comprehensively outline the application and development of Transformers in tactile technology. We first introduce the two fundamental concepts behind the success of the Transformer: the self-attention mechanism and large-scale pre-training. Then, we delve into the application of Transformers in various tactile tasks, including but not limited to object recognition, cross-modal generation, and object manipulation, offering a concise summary of the core methodologies, performance benchmarks, and design highlights. Finally, we suggest potential areas for further research and future work, aiming to generate more interest within the community, tackle existing challenges, and encourage the use of Transformer models in the tactile field.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL
Authors:
Ning Cheng,
Zhaohui Yan,
Ziming Wang,
Zhijie Li,
Jiaming Yu,
Zilong Zheng,
Kewei Tu,
Jinan Xu,
Wenjuan Han
Abstract:
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. Nevertheless, an ongoing controversy exists over the extent to which LLMs can grasp structured semantics. To assess this, we propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured se…
▽ More
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. Nevertheless, an ongoing controversy exists over the extent to which LLMs can grasp structured semantics. To assess this, we propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. In our assessment, we employ the prompting approach, which leads to the creation of our few-shot SRL parser, called PromptSRL. PromptSRL enables LLMs to map natural languages to explicit semantic structures, which provides an interpretable window into the properties of LLMs. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. Additionally, limitations of LLMs are observed in C-arguments, etc. Lastly, we are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion
Authors:
Pengcheng Li,
Jianzong Wang,
Xulong Zhang,
Yong Zhang,
Jing Xiao,
Ning Cheng
Abstract:
One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively…
▽ More
One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
Authors:
Yimin Deng,
Jianzong Wang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue…
▽ More
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering
Authors:
Sheng Ouyang,
Jianzong Wang,
Yong Zhang,
Zhitao Li,
Ziqi Liang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of…
▽ More
Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of queries. These features are then seamlessly integrated into traditional query and passage embeddings using an attention mechanism. By deepening the comprehension of the semantic queries-passage relationship, our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers. Experimental results on robust Question-Answer datasets confirm that our approach effectively handles format-variant but semantically identical queries, highlighting the effectiveness and adaptability of our proposed method.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
Authors:
Jianzong Wang,
Ziqi Liang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared…
▽ More
In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning
Authors:
Ziqi Liang,
Jianzong Wang,
Xulong Zhang,
Yong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of informat…
▽ More
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition
Authors:
Jianzong Wang,
Pengcheng Li,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and s…
▽ More
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Authors:
Darren Edge,
Ha Trinh,
Newman Cheng,
Joshua Bradley,
Alex Chao,
Apurva Mody,
Steven Truitt,
Jonathan Larson
Abstract:
The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused s…
▽ More
The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Authors:
Ning Cheng,
You Li,
Jing Gao,
Bin Fang,
Jinan Xu,
Wenjuan Han
Abstract:
Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-langua…
▽ More
Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
△ Less
Submitted 17 June, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Medical Speech Symptoms Classification via Disentangled Representation
Authors:
Jianzong Wang,
Pengcheng Li,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification…
▽ More
Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms.
△ Less
Submitted 29 April, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
3D Object Visibility Prediction in Autonomous Driving
Authors:
Chuanyu Luo,
Nuo Cheng,
Ren Zhong,
Haipeng Jiang,
Wenyu Chen,
Aoli Wang,
Pu Li
Abstract:
With the rapid advancement of hardware and software technologies, research in autonomous driving has seen significant growth. The prevailing framework for multi-sensor autonomous driving encompasses sensor installation, perception, path planning, decision-making, and motion control. At the perception phase, a common approach involves utilizing neural networks to infer 3D bounding box (Bbox) attrib…
▽ More
With the rapid advancement of hardware and software technologies, research in autonomous driving has seen significant growth. The prevailing framework for multi-sensor autonomous driving encompasses sensor installation, perception, path planning, decision-making, and motion control. At the perception phase, a common approach involves utilizing neural networks to infer 3D bounding box (Bbox) attributes from raw sensor data, including classification, size, and orientation. In this paper, we present a novel attribute and its corresponding algorithm: 3D object visibility. By incorporating multi-task learning, the introduction of this attribute, visibility, negligibly affects the model's effectiveness and efficiency. Our proposal of this attribute and its computational strategy aims to expand the capabilities for downstream tasks, thereby enhancing the safety and reliability of real-time autonomous driving in real-world scenarios.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Structural Knowledge-Driven Meta-Learning for Task Offloading in Vehicular Networks with Integrated Communications, Sensing and Computing
Authors:
Ruijin Sun,
Yao Wen,
Nan Cheng,
Wei Wan,
Rong Chai,
Yilong Hui
Abstract:
Task offloading is a potential solution to satisfy the strict requirements of computation-intensive and latency-sensitive vehicular applications due to the limited onboard computing resources. However, the overwhelming upload traffic may lead to unacceptable uploading time. To tackle this issue, for tasks taking environmental data as input, the data perceived by roadside units (RSU) equipped with…
▽ More
Task offloading is a potential solution to satisfy the strict requirements of computation-intensive and latency-sensitive vehicular applications due to the limited onboard computing resources. However, the overwhelming upload traffic may lead to unacceptable uploading time. To tackle this issue, for tasks taking environmental data as input, the data perceived by roadside units (RSU) equipped with several sensors can be directly exploited for computation, resulting in a novel task offloading paradigm with integrated communications, sensing and computing (I-CSC). With this paradigm, vehicles can select to upload their sensed data to RSUs or transmit computing instructions to RSUs during the offloading. By optimizing the computation mode and network resources, in this paper, we investigate an I-CSC-based task offloading problem to reduce the cost caused by resource consumption while guaranteeing the latency of each task. Although this non-convex problem can be handled by the alternating minimization (AM) algorithm that alternatively minimizes the divided four sub-problems, it leads to high computational complexity and local optimal solution. To tackle this challenge, we propose a creative structural knowledge-driven meta-learning (SKDML) method, involving both the model-based AM algorithm and neural networks. Specifically, borrowing the iterative structure of the AM algorithm, also referred to as structural knowledge, the proposed SKDML adopts long short-term memory (LSTM) network-based meta-learning to learn an adaptive optimizer for updating variables in each sub-problem, instead of the handcrafted counterpart in the AM algorithm.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
GS-EMA: Integrating Gradient Surgery Exponential Moving Average with Boundary-Aware Contrastive Learning for Enhanced Domain Generalization in Aneurysm Segmentation
Authors:
Fengming Lin,
Yan Xia,
Michael MacRaild,
Yash Deo,
Haoran Dou,
Qiongyao Liu,
Nina Cheng,
Nishant Ravikumar,
Alejandro F. Frangi
Abstract:
The automated segmentation of cerebral aneurysms is pivotal for accurate diagnosis and treatment planning. Confronted with significant domain shifts and class imbalance in 3D Rotational Angiography (3DRA) data from various medical institutions, the task becomes challenging. These shifts include differences in image appearance, intensity distribution, resolution, and aneurysm size, all of which com…
▽ More
The automated segmentation of cerebral aneurysms is pivotal for accurate diagnosis and treatment planning. Confronted with significant domain shifts and class imbalance in 3D Rotational Angiography (3DRA) data from various medical institutions, the task becomes challenging. These shifts include differences in image appearance, intensity distribution, resolution, and aneurysm size, all of which complicate the segmentation process. To tackle these issues, we propose a novel domain generalization strategy that employs gradient surgery exponential moving average (GS-EMA) optimization technique coupled with boundary-aware contrastive learning (BACL). Our approach is distinct in its ability to adapt to new, unseen domains by learning domain-invariant features, thereby improving the robustness and accuracy of aneurysm segmentation across diverse clinical datasets. The results demonstrate that our proposed approach can extract more domain-invariant features, minimizing over-segmentation and capturing more complete aneurysm structures.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Variational Entropy Search for Adjusting Expected Improvement
Authors:
Nuojin Cheng,
Stephen Becker
Abstract:
Bayesian optimization is a widely used technique for optimizing black-box functions, with Expected Improvement (EI) being the most commonly utilized acquisition function in this domain. While EI is often viewed as distinct from other information-theoretic acquisition functions, such as entropy search (ES) and max-value entropy search (MES), our work reveals that EI can be considered a special case…
▽ More
Bayesian optimization is a widely used technique for optimizing black-box functions, with Expected Improvement (EI) being the most commonly utilized acquisition function in this domain. While EI is often viewed as distinct from other information-theoretic acquisition functions, such as entropy search (ES) and max-value entropy search (MES), our work reveals that EI can be considered a special case of MES when approached through variational inference (VI). In this context, we have developed the Variational Entropy Search (VES) methodology and the VES-Gamma algorithm, which adapts EI by incorporating principles from information-theoretic concepts. The efficacy of VES-Gamma is demonstrated across a variety of test functions and read datasets, highlighting its theoretical and practical utilities in Bayesian optimization scenarios.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
Authors:
Mingrui Li,
Shuhong Liu,
Heng Zhou,
Guohao Zhu,
Na Cheng,
Tianchen Deng,
Hongyu Wang
Abstract:
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compens…
▽ More
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.
△ Less
Submitted 26 March, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Knowledge-Driven Deep Learning Paradigms for Wireless Network Optimization in 6G
Authors:
Ruijin Sun,
Nan Cheng,
Changle Li,
Fangjiong Chen,
Wen Chen
Abstract:
In the sixth-generation (6G) networks, newly emerging diversified services of massive users in dynamic network environments are required to be satisfied by multi-dimensional heterogeneous resources. The resulting large-scale complicated network optimization problems are beyond the capability of model-based theoretical methods due to the overwhelming computational complexity and the long processing…
▽ More
In the sixth-generation (6G) networks, newly emerging diversified services of massive users in dynamic network environments are required to be satisfied by multi-dimensional heterogeneous resources. The resulting large-scale complicated network optimization problems are beyond the capability of model-based theoretical methods due to the overwhelming computational complexity and the long processing time. Although with fast online inference and universal approximation ability, data-driven deep learning (DL) heavily relies on abundant training data and lacks interpretability. To address these issues, a new paradigm called knowledge-driven DL has emerged, aiming to integrate proven domain knowledge into the construction of neural networks, thereby exploiting the strengths of both methods. This article provides a systematic review of knowledge-driven DL in wireless networks. Specifically, a holistic framework of knowledge-driven DL in wireless networks is proposed, where knowledge sources, knowledge representation, knowledge integration and knowledge application are forming as a closed loop. Then, a detailed taxonomy of knowledge integration approaches, including knowledge-assisted, knowledge-fused, and knowledge-embedded DL, is presented. Several open issues for future research are also discussed. The insights offered in this article provide a basic principle for the design of network optimization that incorporates communication-specific domain knowledge and DL, facilitating the realization of intelligent 6G networks.
△ Less
Submitted 15 January, 2024;
originally announced February 2024.
-
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Authors:
Ming Li,
Yong Zhang,
Shwai He,
Zhitao Li,
Hongyu Zhao,
Jianzong Wang,
Ning Cheng,
Tianyi Zhou
Abstract:
Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a s…
▽ More
Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach.
△ Less
Submitted 7 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Leveraging Biases in Large Language Models: "bias-kNN'' for Effective Few-Shot Learning
Authors:
Yong Zhang,
Hanzhang Li,
Zhitao Li,
Ning Cheng,
Ming Li,
Jing Xiao,
Jianzong Wang
Abstract:
Large Language Models (LLMs) have shown significant promise in various applications, including zero-shot and few-shot learning. However, their performance can be hampered by inherent biases. Instead of traditionally sought methods that aim to minimize or correct these biases, this study introduces a novel methodology named ``bias-kNN''. This approach capitalizes on the biased outputs, harnessing t…
▽ More
Large Language Models (LLMs) have shown significant promise in various applications, including zero-shot and few-shot learning. However, their performance can be hampered by inherent biases. Instead of traditionally sought methods that aim to minimize or correct these biases, this study introduces a novel methodology named ``bias-kNN''. This approach capitalizes on the biased outputs, harnessing them as primary features for kNN and supplementing with gold labels. Our comprehensive evaluations, spanning diverse domain text classification datasets and different GPT-2 model sizes, indicate the adaptability and efficacy of the ``bias-kNN'' method. Remarkably, this approach not only outperforms conventional in-context learning in few-shot scenarios but also demonstrates robustness across a spectrum of samples, templates and verbalizers. This study, therefore, presents a unique perspective on harnessing biases, transforming them into assets for enhanced model performance.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
Authors:
Haobin Tang,
Xulong Zhang,
Ning Cheng,
Jing Xiao,
Jianzong Wang
Abstract:
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our p…
▽ More
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Authors:
Yimin Deng,
Huaizhen Tang,
Xulong Zhang,
Ning Cheng,
Jing Xiao,
Jianzong Wang
Abstract:
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these…
▽ More
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results.
△ Less
Submitted 17 January, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model
Authors:
Bingyuan Zhang,
Xulong Zhang,
Ning Cheng,
Jun Yu,
Jing Xiao,
Jianzong Wang
Abstract:
In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singul…
▽ More
In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singular emotion, failing to adapt to intricate emotions. To overcome these challenges, this paper proposes EmoTalker, an emotionally editable portraits animation approach based on the diffusion model. EmoTalker modifies the denoising process to ensure preservation of the original portrait's identity during inference. To enhance emotion comprehension from text input, Emotion Intensity Block is introduced to analyze fine-grained emotions and strengths derived from prompts. Additionally, a crafted dataset is harnessed to enhance emotion comprehension within prompts. Experiments show the effectiveness of EmoTalker in generating high-quality, emotionally customizable facial expressions.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Authors:
Evan Hubinger,
Carson Denison,
Jesse Mu,
Mike Lambert,
Meg Tong,
Monte MacDiarmid,
Tamera Lanham,
Daniel M. Ziegler,
Tim Maxwell,
Newton Cheng,
Adam Jermyn,
Amanda Askell,
Ansh Radhakrishnan,
Cem Anil,
David Duvenaud,
Deep Ganguli,
Fazl Barez,
Jack Clark,
Kamal Ndousse,
Kshitij Sachan,
Michael Sellitto,
Mrinank Sharma,
Nova DasSarma,
Roger Grosse,
Shauna Kravec
, et al. (14 additional authors not shown)
Abstract:
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa…
▽ More
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
△ Less
Submitted 17 January, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding
Authors:
Jianzong Wang,
Yimin Deng,
Ziqi Liang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit c…
▽ More
This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
Authors:
Yimin Deng,
Xulong Zhang,
Jianzong Wang,
Ning Cheng,
Jing Xiao
Abstract:
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard ne…
▽ More
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
Authors:
Jianzong Wang,
Pengcheng Li,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a seq…
▽ More
Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a sequential autoencoder. When given paired data, the module incorporates a trainable codebook that learns quantized representations under the supervision of the paired data. However, due to the limited paired data in low-resource scenario, these paired data are difficult to cover all phonemes. Then unpaired data is fed to expand the dynamic codebook by adding quantized representation vectors that are sufficiently distant from the existing ones during training. Experiments show that with less than 120 minutes of paired data, the proposed method outperforms existing methods in both subjective and objective metrics.
△ Less
Submitted 2 February, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Digital Twin-based 3D Map Management for Edge-assisted Device Pose Tracking in Mobile AR
Authors:
Conghao Zhou,
Jie Gao,
Mushu Li,
Nan Cheng,
Xuemin Shen,
Weihua Zhuang
Abstract:
Edge-device collaboration has the potential to facilitate compute-intensive device pose tracking for resource-constrained mobile augmented reality (MAR) devices. In this paper, we devise a 3D map management scheme for edge-assisted MAR, wherein an edge server constructs and updates a 3D map of the physical environment by using the camera frames uploaded from an MAR device, to support local device…
▽ More
Edge-device collaboration has the potential to facilitate compute-intensive device pose tracking for resource-constrained mobile augmented reality (MAR) devices. In this paper, we devise a 3D map management scheme for edge-assisted MAR, wherein an edge server constructs and updates a 3D map of the physical environment by using the camera frames uploaded from an MAR device, to support local device pose tracking. Our objective is to minimize the uncertainty of device pose tracking by periodically selecting a proper set of uploaded camera frames and updating the 3D map. To cope with the dynamics of the uplink data rate and the user's pose, we formulate a Bayes-adaptive Markov decision process problem and propose a digital twin (DT)-based approach to solve the problem. First, a DT is designed as a data model to capture the time-varying uplink data rate, thereby supporting 3D map management. Second, utilizing extensive generated data provided by the DT, a model-based reinforcement learning algorithm is developed to manage the 3D map while adapting to these dynamics. Numerical results demonstrate that the designed DT outperforms Markov models in accurately capturing the time-varying uplink data rate, and our devised DT-based 3D map management scheme surpasses benchmark schemes in reducing device pose tracking uncertainty.
△ Less
Submitted 29 January, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Antenna Positioning and Beamforming Design for Fluid-Antenna Enabled Multi-user Downlink Communications
Authors:
Haoran Qin,
Wen Chen,
Zhendong Li,
Qingqing Wu,
Nan Cheng,
Fangjiong Chen
Abstract:
This paper investigates a multiple input single output (MISO) downlink communication system in which users are equipped with fluid antennas (FAs). First, we adopt a field-response based channel model to characterize the downlink channel with respect to FAs' positions. Then, we aim to minimize the total transmit power by jointly optimizing the FAs' positions and beamforming matrix. To solve the res…
▽ More
This paper investigates a multiple input single output (MISO) downlink communication system in which users are equipped with fluid antennas (FAs). First, we adopt a field-response based channel model to characterize the downlink channel with respect to FAs' positions. Then, we aim to minimize the total transmit power by jointly optimizing the FAs' positions and beamforming matrix. To solve the resulting non-convex problem, we employ an alternating optimization (AO) algorithm based on penalty method and successive convex approximation (SCA) to obtain a sub-optimal solution. Numerical results demonstrate that the FA-assisted communication system performs better than conventional fixed position antennas system.
△ Less
Submitted 13 January, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.