Search | arXiv e-print repository

Foundation Models for Music: A Survey

Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg , et al. (18 additional authors not shown)

Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm. △ Less

Submitted 27 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.07099 [pdf]

Bearing Fault Diagnosis using Graph Sampling and Aggregation Network

Authors: Jiaying Chen, Xusheng Du, Yurong Qian, Gwanggil Jeon

Abstract: Bearing fault diagnosis technology has a wide range of practical applications in industrial production, energy and other fields. Timely and accurate detection of bearing faults plays an important role in preventing catastrophic accidents and ensuring product quality. Traditional signal analysis techniques and deep learning-based fault detection algorithms do not take into account the intricate cor… ▽ More Bearing fault diagnosis technology has a wide range of practical applications in industrial production, energy and other fields. Timely and accurate detection of bearing faults plays an important role in preventing catastrophic accidents and ensuring product quality. Traditional signal analysis techniques and deep learning-based fault detection algorithms do not take into account the intricate correlation between signals, making it difficult to further improve detection accuracy. To address this problem, we introduced Graph Sampling and Aggregation (GraphSAGE) network and proposed GraphSAGE-based Bearing fault Diagnosis (GSABFD) algorithm. The original vibration signal is firstly sliced through a fixed size non-overlapping sliding window, and the sliced data is feature transformed using signal analysis methods; then correlations are constructed for the transformed vibration signal and further transformed into vertices in the graph; then the GraphSAGE network is used for training; finally the fault level of the object is calculated in the output layer of the network. The proposed algorithm is compared with five advanced algorithms in a real-world public dataset for experiments, and the results show that the GSABFD algorithm improves the AUC value by 5% compared with the next best algorithm. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2407.18481 [pdf, ps, other]

Finite-time and bumpless transfer control of asynchronously switched systems: An output feedback control approach

Authors: Mo-Ran Liu, Zhen Wu, Xian Du, Zhongyang Fei

Abstract: In this paper, the finite-time control and bumpless transfer control are investigated for switched systems under asynchronously switching. First, a class of dynamic output feedback controllers are designed to stabilize the switched system with measurable system outputs. Considering the improvement of transient performance, the bumpless transfer control and finite-time control are further studied i… ▽ More In this paper, the finite-time control and bumpless transfer control are investigated for switched systems under asynchronously switching. First, a class of dynamic output feedback controllers are designed to stabilize the switched system with measurable system outputs. Considering the improvement of transient performance, the bumpless transfer control and finite-time control are further studied in the controller design. To avoid the control bumps, a practical filter is introduced to make the control signal smoother and continuous. Furthermore, to derive a finite-time bounded system state over short-time intervals, the finite-time analysis is considered in managing the switching process with the average dwell time. New criteria are proposed to analyze the finite-time stability and finite-time boundedness for the closed-loop system and solvable conditions are newly proposed to optimize the controller gain. Finally, the superiorities of the proposed method are validated through an application to a boost converter. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.00297 [pdf]

UADSN: Uncertainty-Aware Dual-Stream Network for Facial Nerve Segmentation

Authors: Guanghao Zhu, Lin Liu, Jing Zhang, Xiaohui Du, Ruqian Hao, Juanxiu Liu

Abstract: Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose a… ▽ More Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose an uncertainty-aware dualstream network (UADSN). UADSN consists of a 2D segmentation stream and a 3D segmentation stream. Predictions from two streams are used to identify uncertain regions, and a consistency loss is employed to supervise the segmentation of these regions. In addition, we introduce channel squeeze & spatial excitation modules into the skip connections of U-shaped networks to extract meaningful spatial information. In order to consider topologypreservation, a clDice loss is introduced into the supervised loss function. Experimental results on the facial nerve dataset demonstrate the effectiveness of UADSN and our submodules. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2406.19649 [pdf]

AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation

Authors: Guanghao Zhu, Jing Zhang, Juanxiu Liu, Xiaohui Du, Ruqian Hao, Yong Liu, Lin Liu

Abstract: Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). First… ▽ More Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). Firstly, we design an adversarial consistency regularization (ACR) approach to enhance knowledge transfer and strengthen prediction consistency under varying perturbation intensities. Second, we apply a feature matching loss for adversarial training to incorporate high-level consistency regularization. Additionally, we present the pyramid channel attention (PCA) and efficient channel and spatial attention (ECSA) modules to improve the discriminator's performance. Finally, we propose an adaptive self-training (AST) approach to ensure the pseudo-labels' quality. The proposed AstMatch has been extensively evaluated with cutting-edge SSL methods on three public-available datasets. The experimental results under different labeled ratios indicate that AstMatch outperforms other existing methods, achieving new state-of-the-art performance. Our code will be available at https://github.com/GuanghaoZhu663/AstMatch. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.15222 [pdf]

Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study

Authors: Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, Jingyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He, Zhenpeng Yuan , et al. (15 additional authors not shown)

Abstract: Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed… ▽ More Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed as having other acute chest pain conditions. Subsequently, these AAS patients will undergo clinically inaccurate or suboptimal differential diagnosis. Fortunately, even under these suboptimal protocols, nearly all these patients underwent non-contrast CT covering the aorta anatomy at the early stage of differential diagnosis. In this study, we developed an artificial intelligence model (DeepAAS) using non-contrast CT, which is highly accurate for identifying AAS and provides interpretable results to assist in clinical decision-making. Performance was assessed in two major phases: a multi-center retrospective study (n = 20,750) and an exploration in real-world emergency scenarios (n = 137,525). In the multi-center cohort, DeepAAS achieved a mean area under the receiver operating characteristic curve of 0.958 (95% CI 0.950-0.967). In the real-world cohort, DeepAAS detected 109 AAS patients with misguided initial suspicion, achieving 92.6% (95% CI 76.2%-97.5%) in mean sensitivity and 99.2% (95% CI 99.1%-99.3%) in mean specificity. Our AI model performed well on non-contrast CT at all applicable early stages of differential diagnosis workflows, effectively reduced the overall missed diagnosis and misdiagnosis rate from 48.8% to 4.8% and shortened the diagnosis time for patients with misguided initial suspicion from an average of 681.8 (74-11,820) mins to 68.5 (23-195) mins. DeepAAS could effectively fill the gap in the current clinical workflow without requiring additional tests. △ Less

Submitted 16 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.04354 [pdf, other]

QiandaoEar22: A high quality noise dataset for identifying specific ship from multiple underwater acoustic targets using ship-radiated noise

Authors: Xiaoyang Du, Feng Hong

Abstract: Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To tackle this issue, we conducted experimental data acquisition, resulting in the release of QiandaoEar22 \textemdash a comprehensive underwater acoustic multi-target d… ▽ More Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To tackle this issue, we conducted experimental data acquisition, resulting in the release of QiandaoEar22 \textemdash a comprehensive underwater acoustic multi-target dataset. This dataset encompasses 9 hours and 28 minutes of real-world ship-radiated noise data and 21 hours and 58 minutes of background noise data. To demonstrate the availability of QiandaoEar22, we executed two experimental tasks. The first task focuses on assessing the presence of ship-radiated noise, while the second task involves identifying specific ships within the recognized targets in the multi-ship mixed data. In the latter task, we extracted eight features from the data and employed six deep learning networks for classification, aiming to evaluate and compare the performance of various features and networks. The experimental results reveal that ship-radiated noise can be successfully identified from background noise in over 99\% of cases. Additionally, for the specific identification of individual ships, the optimal recognition accuracy achieves 99.56\%. Finally, based on our findings, we provide advice on selecting appropriate features and deep learning networks, which may offer valuable insights for related research. Our work not only establishes a benchmark for algorithm evaluation but also inspires the development of innovative methods to enhance UATD and UATR systems. △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2406.04353 [pdf, other]

Introducing the Brand New QiandaoEar22 Dataset for Specific Ship Identification Using Ship-Radiated Noise

Authors: Xiaoyang Du, Feng Hong

Abstract: Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To ntackle this issue, we release QiandaoEar22 \textemdash an underwater acoustic multi-target dataset, which can be download on https://ieee-dataport.org/documents/qian… ▽ More Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To ntackle this issue, we release QiandaoEar22 \textemdash an underwater acoustic multi-target dataset, which can be download on https://ieee-dataport.org/documents/qiandaoear22. This dataset encompasses 9 hours and 28 minutes of real-world ship-radiated noise data and 21 hours and 58 minutes of background noise data. We demonstrate the availability of QiandaoEar22 by conducting an experiment of identifying specific ship from the multiple targets. Taking different features as the input and six deep learning networks as classifier, we evaluate the baseline performance of different methods. The experimental results reveal that identifying the specific target of UUV from others can achieve the optimal recognition accuracy of 97.78\%, and we find using spectrum and MFCC as feature inputs and DenseNet as the classifier can achieve better recognition performance. Our work not only establishes a benchmark for the dataset but helps the further development of innovative methods for the tasks of underwater acoustic target detection (UATD) and underwater acoustic target recognition(UATR). △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2405.03254 [pdf]

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

Authors: Xiaokang Liu, Xiaoxia Du, Juan Liu, Rongfeng Su, Manwa Lawrence Ng, Yumei Zhang, Yudong Yang, Shaofeng Zhao, Lan Wang, Nan Yan

Abstract: Automatic assessment of dysarthria remains a highly challenging task due to high variability in acoustic signals and the limited data. Currently, research on the automatic assessment of dysarthria primarily focuses on two approaches: one that utilizes expert features combined with machine learning, and the other that employs data-driven deep learning methods to extract representations. Research ha… ▽ More Automatic assessment of dysarthria remains a highly challenging task due to high variability in acoustic signals and the limited data. Currently, research on the automatic assessment of dysarthria primarily focuses on two approaches: one that utilizes expert features combined with machine learning, and the other that employs data-driven deep learning methods to extract representations. Research has demonstrated that expert features are effective in representing pathological characteristics, while deep learning methods excel at uncovering latent features. Therefore, integrating the advantages of expert features and deep learning to construct a neural network architecture based on expert knowledge may be beneficial for interpretability and assessment performance. In this context, the present paper proposes a vowel graph attention network based on audio-visual information, which effectively integrates the strengths of expert knowledges and deep learning. Firstly, various features were combined as inputs, including knowledge based acoustical features and deep learning based pre-trained representations. Secondly, the graph network structure based on vowel space theory was designed, allowing for a deep exploration of spatial correlations among vowels. Finally, visual information was incorporated into the model to further enhance its robustness and generalizability. The method exhibited superior performance in regression experiments targeting Frenchay scores compared to existing approaches. △ Less

Submitted 6 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 10 pages, 7 figures, 7 tables

arXiv:2404.13598 [pdf, other]

An Integrated Communication and Computing Scheme for Wi-Fi Networks based on Generative AI and Reinforcement Learning

Authors: Xinyang Du, Xuming Fang

Abstract: The continuous evolution of future mobile communication systems is heading towards the integration of communication and computing, with Mobile Edge Computing (MEC) emerging as a crucial means of implementing Artificial Intelligence (AI) computation. MEC could enhance the computational performance of wireless edge networks by offloading computing-intensive tasks to MEC servers. However, in edge com… ▽ More The continuous evolution of future mobile communication systems is heading towards the integration of communication and computing, with Mobile Edge Computing (MEC) emerging as a crucial means of implementing Artificial Intelligence (AI) computation. MEC could enhance the computational performance of wireless edge networks by offloading computing-intensive tasks to MEC servers. However, in edge computing scenarios, the sparse sample problem may lead to high costs of time-consuming model training. This paper proposes an MEC offloading decision and resource allocation solution that combines generative AI and deep reinforcement learning (DRL) for the communication-computing integration scenario in the 802.11ax Wi-Fi network. Initially, the optimal offloading policy is determined by the joint use of the Generative Diffusion Model (GDM) and the Twin Delayed DDPG (TD3) algorithm. Subsequently, resource allocation is accomplished by using the Hungarian algorithm. Simulation results demonstrate that the introduction of Generative AI significantly reduces model training costs, and the proposed solution exhibits significant reductions in system task processing latency and total energy consumption costs. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: This paper has been submitted to GlobeCom 2024 and is currently under review

arXiv:2404.10605 [pdf, other]

UAV Trajectory Optimization for Sensing Exploiting Target Location Distribution Map

Authors: Xiangming Du, Shuowen Zhang, Liang Liu

Abstract: In this paper, we study the trajectory optimization of a cellular-connected unmanned aerial vehicle (UAV) which aims to sense the location of a target while maintaining satisfactory communication quality with the ground base stations (GBSs). In contrast to most existing works which assumed the target's location is known, we focus on a more challenging scenario where the exact location of the targe… ▽ More In this paper, we study the trajectory optimization of a cellular-connected unmanned aerial vehicle (UAV) which aims to sense the location of a target while maintaining satisfactory communication quality with the ground base stations (GBSs). In contrast to most existing works which assumed the target's location is known, we focus on a more challenging scenario where the exact location of the target to be sensed is unknown and random, while its distribution is known a priori and stored in a novel target location distribution map. Based on this map, the probability for the UAV to successfully sense the target can be expressed as a function of the UAV's trajectory. We aim to optimize the UAV's trajectory between two pre-determined locations to maximize the overall sensing probability during its flight, subject to a GBS-UAV communication quality constraint at each time instant and a maximum mission completion time constraint. Despite the non-convexity and NP-hardness of this problem, we devise three high-quality suboptimal solutions tailored for it with polynomial complexity. Numerical results show that our proposed designs outperform various benchmark schemes. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: to appear in IEEE Vehicular Technology Conference (VTC) Spring, 2024

arXiv:2404.06393 [pdf, other]

MuPT: A Generative Symbolic Music Pretrained Transformer

Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions. △ Less

Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.15446 [pdf]

Shape Sensing for Continuum Robotics using Optoelectronic Sensors with Convex Reflectors

Authors: Dalia Osman, Xinli Du, Timothy Minton, Yohan Noh

Abstract: Three-dimensional shape sensing in soft and continuum robotics is a crucial aspect for stable actuation and control in fields such as Minimally Invasive surgery, as the estimation of complex curvatures while using continuum robotic tools is required to manipulate through fragile paths. This challenge has been addressed using a range of different sensing techniques, for example, Fibre Bragg grating… ▽ More Three-dimensional shape sensing in soft and continuum robotics is a crucial aspect for stable actuation and control in fields such as Minimally Invasive surgery, as the estimation of complex curvatures while using continuum robotic tools is required to manipulate through fragile paths. This challenge has been addressed using a range of different sensing techniques, for example, Fibre Bragg grating (FBG) technology, inertial measurement unit (IMU) sensor networks or stretch sensors. Previously, an optics-based method, using optoelectronic sensors was explored, offering a simple and cost-effective solution for shape sensing in a flexible tendon-actuated manipulator in two orientations. This was based on proximity-modulated angle estimation and has been the basis for the shape-sensing method addressed in this paper. The improved and miniaturized technique demonstrated in this paper is based on the use of a spherically shaped reflector with optoelectronic sensors integrated into a tendon actuated robotic manipulator. Upgraded sensing capability is achieved using optimization of the spherical reflector shape in terms of sensor range and resolution, and improved calibration is achieved through the integration of spherical bearings for friction-free motion. Shape estimation is achieved in two orientations upon calibration of sensors, with a maximum Root Mean Square Error (RMS) of 3.37°. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2402.17785 [pdf, other]

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

Authors: Xia Liang, Xingjian Du, Jiaju Lin, Pei Zou, Yuan Wan, Bilei Zhu

Abstract: Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Eval… ▽ More Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer. △ Less

Submitted 6 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

arXiv:2401.17133 [pdf, other]

A Proactive and Dual Prevention Mechanism against Illegal Song Covers empowered by Singing Voice Conversion

Authors: Guangke Chen, Yedi Zhang, Fu Song, Ting Wang, Xiaoning Du, Yang Liu

Abstract: Singing voice conversion (SVC) automates song covers by converting one singer's singing voice into another target singer's singing voice with the original lyrics and melody. However, it raises serious concerns about copyright and civil right infringements to multiple entities. This work proposes SongBsAb, the first proactive approach to mitigate unauthorized SVC-based illegal song covers. SongBsAb… ▽ More Singing voice conversion (SVC) automates song covers by converting one singer's singing voice into another target singer's singing voice with the original lyrics and melody. However, it raises serious concerns about copyright and civil right infringements to multiple entities. This work proposes SongBsAb, the first proactive approach to mitigate unauthorized SVC-based illegal song covers. SongBsAb introduces human-imperceptible perturbations to singing voices before releasing them, so that when they are used, the generation process of SVC will be interfered, resulting in unexpected singing voices. SongBsAb features a dual prevention effect by causing both (singer) identity disruption and lyric disruption, namely, the SVC-covered singing voice neither imitates the target singer nor preserves the original lyrics. To improve the imperceptibility of perturbations, we refine a psychoacoustic model-based loss with the backing track as an additional masker, a unique accompanying element for singing voices compared to ordinary speech voices. To enhance the transferability, we propose to utilize a frame-level interaction reduction-based loss. We demonstrate the prevention effectiveness, utility, and robustness of SongBsAb on three SVC models and two datasets using both objective and human study-based subjective metrics. Our work fosters an emerging research direction for mitigating illegal automated song covers. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.08136 [pdf, other]

Bias-Compensated State of Charge and State of Health Joint Estimation for Lithium Iron Phosphate Batteries

Authors: Baozhao Yi, Xinhao Du, Jiawei Zhang, Xiaogang Wu, Qiuhao Hu, Weiran Jiang, Xiaosong Hu, Ziyou Song

Abstract: Accurate estimation of the state of charge (SOC) and state of health (SOH) is crucial for the safe and reliable operation of batteries. Voltage measurement bias highly affects state estimation accuracy, especially in Lithium Iron Phosphate (LFP) batteries, which are susceptible due to their flat open-circuit voltage (OCV) curves. This work introduces a bias-compensated algorithm to reliably estima… ▽ More Accurate estimation of the state of charge (SOC) and state of health (SOH) is crucial for the safe and reliable operation of batteries. Voltage measurement bias highly affects state estimation accuracy, especially in Lithium Iron Phosphate (LFP) batteries, which are susceptible due to their flat open-circuit voltage (OCV) curves. This work introduces a bias-compensated algorithm to reliably estimate the SOC and SOH of LFP batteries under the influence of voltage measurement bias. Specifically, SOC and SOH are estimated using the Dual Extended Kalman Filter (DEKF) in the high-slope SOC range, where voltage measurement bias effects are weak. Besides, the voltage measurement biases estimated in the low-slope SOC regions are compensated in the following joint estimation of SOC and SOH to enhance the state estimation accuracy further. Experimental results indicate that the proposed algorithm significantly outperforms the traditional method, which does not consider biases under different temperatures and aging conditions. Additionally, the bias-compensated algorithm can achieve low estimation errors of below 1.5% for SOC and 2% for SOH, even with a 30mV voltage measurement bias. Finally, even if the voltage measurement biases change in operation, the proposed algorithm can remain robust and keep the estimated errors of states around 2%. △ Less

Submitted 12 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: 9 pages and 8 figures

arXiv:2401.02662 [pdf, other]

GainNet: Coordinates the Odd Couple of Generative AI and 6G Networks

Authors: Ning Chen, Jie Yang, Zhipeng Cheng, Xuwei Fan, Zhang Liu, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani

Abstract: The rapid expansion of AI-generated content (AIGC) reflects the iteration from assistive AI towards generative AI (GAI) with creativity. Meanwhile, the 6G networks will also evolve from the Internet-of-everything to the Internet-of-intelligence with hybrid heterogeneous network architectures. In the future, the interplay between GAI and the 6G will lead to new opportunities, where GAI can learn th… ▽ More The rapid expansion of AI-generated content (AIGC) reflects the iteration from assistive AI towards generative AI (GAI) with creativity. Meanwhile, the 6G networks will also evolve from the Internet-of-everything to the Internet-of-intelligence with hybrid heterogeneous network architectures. In the future, the interplay between GAI and the 6G will lead to new opportunities, where GAI can learn the knowledge of personalized data from the massive connected 6G end devices, while GAI's powerful generation ability can provide advanced network solutions for 6G network and provide 6G end devices with various AIGC services. However, they seem to be an odd couple, due to the contradiction of data and resources. To achieve a better-coordinated interplay between GAI and 6G, the GAI-native networks (GainNet), a GAI-oriented collaborative cloud-edge-end intelligence framework, is proposed in this paper. By deeply integrating GAI with 6G network design, GainNet realizes the positive closed-loop knowledge flow and sustainable-evolution GAI model optimization. On this basis, the GAI-oriented generic resource orchestration mechanism with integrated sensing, communication, and computing (GaiRom-ISCC) is proposed to guarantee the efficient operation of GainNet. Two simple case studies demonstrate the effectiveness and robustness of the proposed schemes. Finally, we envision the key challenges and future directions concerning the interplay between GAI models and 6G networks. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 10 pages, 5 figures, 1 table

arXiv:2312.16014 [pdf, other]

Passive Non-Line-of-Sight Imaging with Light Transport Modulation

Authors: Jiarui Zhang, Ruixu Geng, Xiaolong Du, Yan Chen, Houqiang Li, Yang Hu

Abstract: Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, whi… ▽ More Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM. △ Less

Submitted 26 March, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

arXiv:2311.04942 [pdf, other]

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation

Authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Xiaoxi Du, Kaifeng Pang, Qi Miao, Steven S. Raman, Demetri Terzopoulos, Kyunghyun Sung

Abstract: A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D meth… ▽ More A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM. △ Less

Submitted 26 November, 2023; v1 submitted 7 November, 2023; originally announced November 2023.

arXiv:2311.03815 [pdf, other]

Integrated Sensing, Communication, and Computing for Cost-effective Multimodal Federated Perception

Authors: Ning Chen, Zhipeng Cheng, Xuwei Fan, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani

Abstract: Federated learning (FL) is a classic paradigm of 6G edge intelligence (EI), which alleviates privacy leaks and high communication pressure caused by traditional centralized data processing in the artificial intelligence of things (AIoT). The implementation of multimodal federated perception (MFP) services involves three sub-processes, including sensing-based multimodal data generation, communicati… ▽ More Federated learning (FL) is a classic paradigm of 6G edge intelligence (EI), which alleviates privacy leaks and high communication pressure caused by traditional centralized data processing in the artificial intelligence of things (AIoT). The implementation of multimodal federated perception (MFP) services involves three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately relying on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is crucial to the MFP networks. To address the above issues, this paper investigates service-oriented resource management with integrated sensing, communication, and computing (ISCC). With the incentive mechanism of the MFP service market, the resources management problem is redefined as a social welfare maximization problem, where the idea of "expanding resources" and "reducing costs" is used to improve learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.13882 [pdf]

NMR Spectra Denoising with Vandermonde Constraints

Authors: Di Guo, Runmin Xu, Jinyu Wu, Meijin Lin, Xiaofeng Du, Xiaobo Qu

Abstract: Nuclear magnetic resonance (NMR) spectroscopy serves as an important tool to analyze chemicals and proteins in bioengineering. However, NMR signals are easily contaminated by noise during the data acquisition, which can affect subsequent quantitative analysis. Therefore, denoising NMR signals has been a long-time concern. In this work, we propose an optimization model-based iterative denoising met… ▽ More Nuclear magnetic resonance (NMR) spectroscopy serves as an important tool to analyze chemicals and proteins in bioengineering. However, NMR signals are easily contaminated by noise during the data acquisition, which can affect subsequent quantitative analysis. Therefore, denoising NMR signals has been a long-time concern. In this work, we propose an optimization model-based iterative denoising method, CHORD-V, by treating the time-domain NMR signal as damped exponentials and maintaining the exponential signal form with a Vandermonde factorization. Results on both synthetic and realistic NMR data show that CHORD-V has a superior denoising performance over typical Cadzow and rQRd methods, and the state-of-the-art CHORD method. CHORD-V restores low-intensity spectral peaks more accurately, especially when the noise is relatively high. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 10 pages, 9 figures

arXiv:2310.10159 [pdf, other]

Joint Music and Language Attention Models for Zero-shot Music Tagging

Authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong

Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio… ▽ More Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: \begin{keywords} Music tagging, joint music and language attention models, Music Foundation Model. \end{keywords}

arXiv:2308.12770 [pdf, other]

WavMark: Watermarking for Audio Generation

Authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei

Abstract: Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic… ▽ More Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work. △ Less

Submitted 7 January, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

arXiv:2307.08556 [pdf, other]

Machine-Learning-based Colorectal Tissue Classification via Acoustic Resolution Photoacoustic Microscopy

Authors: Shangqing Tong, Peng Ge, Yanan Jiao, Zhaofu Ma, Ziye Li, Longhai Liu, Feng Gao, Xiaohui Du, Fei Gao

Abstract: Colorectal cancer is a deadly disease that has become increasingly prevalent in recent years. Early detection is crucial for saving lives, but traditional diagnostic methods such as colonoscopy and biopsy have limitations. Colonoscopy cannot provide detailed information within the tissues affected by cancer, while biopsy involves tissue removal, which can be painful and invasive. In order to impro… ▽ More Colorectal cancer is a deadly disease that has become increasingly prevalent in recent years. Early detection is crucial for saving lives, but traditional diagnostic methods such as colonoscopy and biopsy have limitations. Colonoscopy cannot provide detailed information within the tissues affected by cancer, while biopsy involves tissue removal, which can be painful and invasive. In order to improve diagnostic efficiency and reduce patient suffering, we studied machine-learningbased approach for colorectal tissue classification that uses acoustic resolution photoacoustic microscopy (ARPAM). With this tool, we were able to classify benign and malignant tissue using multiple machine learning methods. Our results were analyzed both quantitatively and qualitatively to evaluate the effectiveness of our approach. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2306.01120 [pdf, other]

Frequency-dependent Switching Control for Disturbance Attenuation of Linear Systems

Authors: Jingjing Zhang, Jan Heiland, Peter Benner, Xin Du

Abstract: The generalized Kalman-Yakubovich-Popov lemma as established by Iwasaki and Hara in 2005 marks a milestone in the analysis and synthesis of linear systems from a finite-frequency perspective. Given a pre-specified frequency band, it allows us to produce passive controllers with excellent in-band disturbance attenuation performance at the expense of some of the out-of-band performance. This paper f… ▽ More The generalized Kalman-Yakubovich-Popov lemma as established by Iwasaki and Hara in 2005 marks a milestone in the analysis and synthesis of linear systems from a finite-frequency perspective. Given a pre-specified frequency band, it allows us to produce passive controllers with excellent in-band disturbance attenuation performance at the expense of some of the out-of-band performance. This paper focuses on control design of linear systems in the presence of disturbances with non-strictly or non-stationary limited frequency spectrum. We first propose a class of frequency-dependent excited energy functions (FD-EEF) as well as frequency-dependent excited power functions (FD-EPF), which possess a desirable frequency-selectiveness property with regard to the in-band and out-of-band excited energy as well as excited power of the system. Based upon a group of frequency-selective passive controllers, we then develop a frequency-dependent switching control (FDSC) scheme that selects the most appropriate controller at runtime. We show that our FDSC scheme is capable to approximate the solid in-band performance while maintaining acceptable out-of-band performance with regard to global time horizons as well as localized time horizons. The method is illustrated by a commonly used benchmark model. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.07447 [pdf, other]

Universal Source Separation with Weakly Labelled Data

Authors: Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley

Abstract: Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There… ▽ More Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss △ Less

Submitted 11 May, 2023; originally announced May 2023.

arXiv:2305.07220 [pdf, other]

Physical-layer Adversarial Robustness for Deep Learning-based Semantic Communications

Authors: Guoshun Nan, Zhichun Li, Jinli Zhai, Qimei Cui, Gong Chen, Xin Du, Xuefei Zhang, Xiaofeng Tao, Zhu Han, Tony Q. S. Quek

Abstract: End-to-end semantic communications (ESC) rely on deep neural networks (DNN) to boost communication efficiency by only transmitting the semantics of data, showing great potential for high-demand mobile applications. We argue that central to the success of ESC is the robust interpretation of conveyed semantics at the receiver side, especially for security-critical applications such as automatic driv… ▽ More End-to-end semantic communications (ESC) rely on deep neural networks (DNN) to boost communication efficiency by only transmitting the semantics of data, showing great potential for high-demand mobile applications. We argue that central to the success of ESC is the robust interpretation of conveyed semantics at the receiver side, especially for security-critical applications such as automatic driving and smart healthcare. However, robustifying semantic interpretation is challenging as ESC is extremely vulnerable to physical-layer adversarial attacks due to the openness of wireless channels and the fragileness of neural models. Toward ESC robustness in practice, we ask the following two questions: Q1: For attacks, is it possible to generate semantic-oriented physical-layer adversarial attacks that are imperceptible, input-agnostic and controllable? Q2: Can we develop a defense strategy against such semantic distortions and previously proposed adversaries? To this end, we first present MobileSC, a novel semantic communication framework that considers the computation and memory efficiency in wireless environments. Equipped with this framework, we propose SemAdv, a physical-layer adversarial perturbation generator that aims to craft semantic adversaries over the air with the abovementioned criteria, thus answering the Q1. To better characterize the realworld effects for robust training and evaluation, we further introduce a novel adversarial training method SemMixed to harden the ESC against SemAdv attacks and existing strong threats, thus answering the Q2. Extensive experiments on three public benchmarks verify the effectiveness of our proposed methods against various physical adversarial attacks. We also show some interesting findings, e.g., our MobileSC can even be more robust than classical block-wise communication systems in the low SNR regime. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: 17 pages, 28 figures, accepted by IEEE jsac

arXiv:2303.11692 [pdf, other]

ByteCover3: Accurate Cover Song Identification on Short Queries

Authors: Xingjian Du, Zijie Wang, Xia Liang, Huidong Liang, Bilei Zhu, Zejun Ma

Abstract: Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and w… ▽ More Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: Accepeted by ICASSP 2023

arXiv:2303.02657 [pdf, ps, other]

Sparsity-Aware Intelligent Massive Random Access Control in Open RAN: A Reinforcement Learning Based Approach

Authors: Xiao Tang, Sicong Liu, Xiaojiang Du, Mohsen Guizani

Abstract: Massive random access of devices in the emerging Open Radio Access Network (O-RAN) brings great challenge to the access control and management. Exploiting the bursting nature of the access requests, sparse active user detection (SAUD) is an efficient enabler towards efficient access management, but the sparsity might be deteriorated in case of uncoordinated massive access requests. To dynamically… ▽ More Massive random access of devices in the emerging Open Radio Access Network (O-RAN) brings great challenge to the access control and management. Exploiting the bursting nature of the access requests, sparse active user detection (SAUD) is an efficient enabler towards efficient access management, but the sparsity might be deteriorated in case of uncoordinated massive access requests. To dynamically preserve the sparsity of access requests, a reinforcement-learning (RL)-assisted scheme of closed-loop access control utilizing the access class barring technique is proposed, where the RL policy is determined through continuous interaction between the RL agent, i.e., a next generation node base (gNB), and the environment. The proposed scheme can be implemented by the near-real-time RAN intelligent controller (near-RT RIC) in O-RAN, supporting rapid switching between heterogeneous vertical applications, such as mMTC and uRLLC services. Moreover, a data-driven scheme of deep-RL-assisted SAUD is proposed to resolve highly complex environments with continuous and high-dimensional state and action spaces, where a replay buffer is applied for automatic large-scale data collection. An actor-critic framework is formulated to incorporate the strategy-learning modules into the near-RT RIC. Simulation results show that the proposed schemes can achieve superior performance in both access efficiency and user detection accuracy over the benchmark scheme for different heterogeneous services with massive access requests. △ Less

Submitted 5 March, 2023; originally announced March 2023.

Comments: This paper has been submitted to IEEE Journal on Selected Areas in Communications

arXiv:2212.09930 [pdf, other]

Frequency-limited H$_2$ Model Order Reduction Based on Relative Error

Authors: Umair Zulfiqar, Xin Du, Qiuyan Song, Zhi-Hua Xiao, Victor Sreeram

Abstract: Frequency-limited model order reduction aims to approximate a high-order model with a reduced-order model that maintains high fidelity within a specific frequency range. Beyond this range, a decrease in accuracy is acceptable due to the nature of the problem. The quality of the reduced-order model is typically evaluated using absolute or relative measures of approximation error. Relative error, wh… ▽ More Frequency-limited model order reduction aims to approximate a high-order model with a reduced-order model that maintains high fidelity within a specific frequency range. Beyond this range, a decrease in accuracy is acceptable due to the nature of the problem. The quality of the reduced-order model is typically evaluated using absolute or relative measures of approximation error. Relative error, which represents the percentage error, becomes particularly relevant when reducing a plant model for the purpose of designing a reduced-order controller. This paper derives the necessary conditions for achieving a local optimum of the frequency-limited H2 norm for the relative error system. Based on these optimality conditions, an oblique projection algorithm is proposed to ensure a small relative error within the desired frequency interval. Unlike existing algorithms, the proposed approach does not necessitate solving large-scale Lyapunov and Ricatti equations. Instead, the proposed algorithm relies on solving sparse-dense Sylvester equations, which typically emerge in the majority of H2 model order reduction algorithms, but can be efficiently solved. To evaluate the performance of the proposed algorithm, a comparison is conducted with three existing techniques: frequency-limited balanced truncation, frequency-limited balanced stochastic truncation, and frequency-limited iterative Rational Krylov algorithm. The comparative analysis focuses on designing reduced-order controllers for high-order plants. Numerical results confirm that the reduced-order controllers obtained using the proposed algorithm ensure superior robust closed-loop stability. △ Less

Submitted 24 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: arXiv admin note: text overlap with arXiv:2212.08247

arXiv:2212.08247 [pdf, other]

Relative Error-based Time-limited H2 Model Order Reduction via Oblique Projection

Authors: Umair Zulfiqar, Xin Du, Qiuyan Song, Zhi-Hua Xiao, Victor Sreeram

Abstract: In time-limited model order reduction, a reduced-order approximation of the original high-order model is obtained that accurately approximates the original model within the desired limited time interval. Accuracy outside that time interval is not that important. The error incurred when a reduced-order model is used as a surrogate for the original model can be quantified in absolute or relative ter… ▽ More In time-limited model order reduction, a reduced-order approximation of the original high-order model is obtained that accurately approximates the original model within the desired limited time interval. Accuracy outside that time interval is not that important. The error incurred when a reduced-order model is used as a surrogate for the original model can be quantified in absolute or relative terms to access the performance of the model reduction algorithm. The relative error is generally more meaningful than an absolute error because if the original and reduced systems' responses are of small magnitude, the absolute error is small in magnitude as well. However, this does not necessarily mean that the reduced model is accurate. The relative error in such scenarios is useful and meaningful as it quantifies percentage error irrespective of the magnitude of the system's response. In this paper, the necessary conditions for a local optimum of the time-limited H2 norm of the relative error system are derived. Inspired by these conditions, an oblique projection algorithm is proposed that ensures small H2-norm relative error within the desired time interval. Unlike the existing relative error-based model reduction algorithms, the proposed algorithm does not require solutions of large-scale Lyapunov and Riccati equations. The proposed algorithm is compared with time-limited balanced truncation, time-limited balanced stochastic truncation, and time-limited iterative Rational Krylov algorithm. Numerical results confirm the superiority of the proposed algorithm over these existing algorithms. △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2209.11455 [pdf, other]

Modular Degradation Simulation and Restoration for Under-Display Camera

Authors: Yang Zhou, Yuda Song, Xin Du

Abstract: Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative advers… ▽ More Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative adversarial network (GAN) framework for simulating UDC imaging. Specifically, we note that the UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. Thus we model each degradation with a characteristic-related modular network, and all modular networks are cascaded to form the generator. Together with a pixel-wise discriminator and supervised loss, we can train the generator to simulate the UDC imaging degradation process. Furthermore, we present a Transformer-style network named DWFormer for UDC image restoration. For practical purposes, we use depth-wise convolution instead of the multi-head self-attention to aggregate local spatial information. Moreover, we propose a novel channel attention module to aggregate global information, which is critical for brightness recovery. We conduct evaluations on the UDC benchmark, and our method surpasses the previous state-of-the-art models by 1.23 dB on the P-OLED track and 0.71 dB on the T-OLED track, respectively. △ Less

Submitted 23 September, 2022; originally announced September 2022.

arXiv:2205.05939 [pdf, other]

NLOS Error Mitigation Using Weighted Least Squares and Kalman Filter in UWB Positioning

Authors: Ruixin Fan, Xin Du

Abstract: In wireless positioning systems, non-line-of-sight (NLOS) is a challenging problem. NLOS causes great ranging bias and location error, so NLOS mitigation is essential for high accuracy positioning. In this letter, we propose the Weighted-Least-Squares Robust Kalman Filter (WLS-RKF) for NLOS identification and mitigation. WLS-RKF employs a hypothesis test based on Mahalanobis distance for NLOS iden… ▽ More In wireless positioning systems, non-line-of-sight (NLOS) is a challenging problem. NLOS causes great ranging bias and location error, so NLOS mitigation is essential for high accuracy positioning. In this letter, we propose the Weighted-Least-Squares Robust Kalman Filter (WLS-RKF) for NLOS identification and mitigation. WLS-RKF employs a hypothesis test based on Mahalanobis distance for NLOS identification, and updates the corresponding Kalman filter using the WLS solution. It requires no prior knowledge about NLOS distribution or signal features. We perform simulations and experiments for ultra-wideband (UWB) positioning in various scenarios. The results confirm that WLS-RKF effectively mitigates NLOS error and achieves 5cm positioning accuracy. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: 6 pages, 5 figures

arXiv:2205.05036 [pdf, other]

Multi-agent Reinforcement Learning for Dynamic Resource Management in 6G in-X Subnetworks

Authors: Xiao Du, Ting Wang, Qiang Feng, Chenhui Ye, Tao Tao, Yuanming Shi, Mingsong Chen

Abstract: The 6G network enables a subnetwork-wide evolution, resulting in a "network of subnetworks". However, due to the dynamic mobility of wireless subnetworks, the data transmission of intra-subnetwork and inter-subnetwork will inevitably interfere with each other, which poses a great challenge to radio resource management. Moreover, most of the existing approaches require the instantaneous channel gai… ▽ More The 6G network enables a subnetwork-wide evolution, resulting in a "network of subnetworks". However, due to the dynamic mobility of wireless subnetworks, the data transmission of intra-subnetwork and inter-subnetwork will inevitably interfere with each other, which poses a great challenge to radio resource management. Moreover, most of the existing approaches require the instantaneous channel gain between subnetworks, which are usually difficult to be collected. To tackle these issues, in this paper we propose a novel effective intelligent radio resource management method using multi-agent deep reinforcement learning (MARL), which only needs the sum of received power, named received signal strength indicator (RSSI), on each channel instead of channel gains. However, to directly separate individual interference from RSSI is an almost impossible thing. To this end, we further propose a novel MARL architecture, named GA-Net, which integrates a hard attention layer to model the importance distribution of inter-subnetwork relationships based on RSSI and exclude the impact of unrelated subnetworks, and employs a graph attention network with a multi-head attention layer to exact the features and calculate their weights that will impact individual throughput. Experimental results prove that our proposed framework significantly outperforms both traditional and MARL-based methods in various aspects. △ Less

Submitted 10 May, 2022; originally announced May 2022.

arXiv:2204.03356 [pdf, other]

Alternating Direction Based Sequential Boolean Quadratic Programming Method for Transmit Antenna Selection

Authors: Shijie Zhu, Xu Du

Abstract: The wireless mobile communication system is updated and iterated on the whole almost every decade. It is now in the development period of the application scenarios of the fifth generation mobile communication system (5G). Unfortunately, 5G relies on plenty of small base stations with a large number of antennas that consume a lot of energy. In this paper, a novel Boolean variable quadratic programm… ▽ More The wireless mobile communication system is updated and iterated on the whole almost every decade. It is now in the development period of the application scenarios of the fifth generation mobile communication system (5G). Unfortunately, 5G relies on plenty of small base stations with a large number of antennas that consume a lot of energy. In this paper, a novel Boolean variable quadratic programming algorithm is designed for the antenna selection optimization problem to reduce power consumption. Experiments show that the proposed algorithm achieves high complementarity satisfaction accuracy with only a few steps. △ Less

Submitted 20 September, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

arXiv:2203.14011 [pdf, other]

Approximations for Optimal Experimental Design in Power System Parameter Estimation

Authors: Xu Du, Alexander Engelmann, Timm Faulwasser, Boris Houska

Abstract: This paper is about computationally tractable methods for power system parameter estimation and Optimal Experiment Design (OED). Here, the main motivation is that OED has the potential to significantly increase the accuracy of power system parameter estimates, for example, if only a few batches of data are available. The problem is, however, that solving the exact OED problem for larger power grid… ▽ More This paper is about computationally tractable methods for power system parameter estimation and Optimal Experiment Design (OED). Here, the main motivation is that OED has the potential to significantly increase the accuracy of power system parameter estimates, for example, if only a few batches of data are available. The problem is, however, that solving the exact OED problem for larger power grids turns out to be computationally expensive and, in many cases, even computationally intractable. Therefore, the present paper proposes three numerical approximation techniques, which increase the computational tractability of OED for power systems. These approximation techniques are bench-marked on a 5-bus and a 14-bus case studies. △ Less

Submitted 16 September, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

arXiv:2202.00874 [pdf, other]

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Abstract: Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in au… ▽ More Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: Preprint version for ICASSP 2022, Singapore

arXiv:2112.07891 [pdf, other]

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Abstract: Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a univ… ▽ More Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases. △ Less

Submitted 12 February, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Preprint version for Association for the Advancement of Artificial Intelligence Conference, AAAI 2022

arXiv:2110.10755 [pdf, other]

Toward Real-world Image Super-resolution via Hardware-based Adaptive Degradation Models

Authors: Rui Ma, Johnathan Czernik, Xian Du

Abstract: Most single image super-resolution (SR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs, which are simulated by a predetermined degradation operation, e.g., bicubic downsampling. However, these methods only learn the inverse process of the predetermined operation, so they fail to super resolve the real-world LR images; the true formulation deviates from… ▽ More Most single image super-resolution (SR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs, which are simulated by a predetermined degradation operation, e.g., bicubic downsampling. However, these methods only learn the inverse process of the predetermined operation, so they fail to super resolve the real-world LR images; the true formulation deviates from the predetermined operation. To address this problem, we propose a novel supervised method to simulate an unknown degradation process with the inclusion of the prior hardware knowledge of the imaging system. We design an adaptive blurring layer (ABL) in the supervised learning framework to estimate the target LR images. The hyperparameters of the ABL can be adjusted for different imaging hardware. The experiments on the real-world datasets validate that our degradation model can estimate LR images more accurately than the predetermined degradation operation, as well as facilitate existing SR methods to perform reconstructions on real-world LR images more accurately than the conventional approaches. △ Less

Submitted 20 October, 2021; originally announced October 2021.

arXiv:2109.01696 [pdf, other]

Revisiting 3D ResNets for Video Recognition

Authors: Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello

Abstract: A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed… ▽ More A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0 on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated in a self-supervised setup using contrastive learning, demonstrating improved performance. Code is available at: https://github.com/tensorflow/models/tree/master/official. △ Less

Submitted 3 September, 2021; originally announced September 2021.

Comments: 6 pages

arXiv:2107.06185 [pdf]

A new method for vehicle system safety design based on data mining with uncertainty modeling

Authors: Xianping Du, Binhui Jiang, Feng Zhu

Abstract: In this research, a new data mining-based design approach has been developed for designing complex mechanical systems such as a crashworthy passenger car with uncertainty modeling. The method allows exploring the big crash simulation dataset to design the vehicle at multi-levels in a top-down manner (main energy absorbing system, components, and geometric features) and derive design rules based on… ▽ More In this research, a new data mining-based design approach has been developed for designing complex mechanical systems such as a crashworthy passenger car with uncertainty modeling. The method allows exploring the big crash simulation dataset to design the vehicle at multi-levels in a top-down manner (main energy absorbing system, components, and geometric features) and derive design rules based on the whole vehicle body safety requirements to make decisions towards the component and sub-component level design. Full vehicle and component simulation datasets are mined to build decision trees, where the interrelationship among parameters can be revealed and the design rules are derived to produce designs with good performance. This method has been extended by accounting for the uncertainty in the design variables. A new decision tree algorithm for uncertain data (DTUD) is developed to produce the desired designs and evaluate the design performance variations due to the uncertainty in design variables. The framework of this method is implemented by combining the design of experiments (DOE) and crash finite element analysis (FEA) and then demonstrated by designing a passenger car subject to front impact. The results show that the new methodology could achieve the design objectives efficiently and effectively. By applying the new method, the reliability of the final designs is also increased greatly. This approach has the potential to be applied as a general design methodology for a wide range of complex structures and mechanical systems. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: 38 pages, 21 figures, 6 tables

arXiv:2106.11411 [pdf, other]

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Authors: Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren

Abstract: Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate inform… ▽ More Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: Accepted by INTERSPEECH 2021

arXiv:2106.11277 [pdf]

Attention-based Neural Network for Driving Environment Complexity Perception

Authors: Ce Zhang, Azim Eskandarian, Xuelai Du

Abstract: Environment perception is crucial for autonomous vehicle (AV) safety. Most existing AV perception algorithms have not studied the surrounding environment complexity and failed to include the environment complexity parameter. This paper proposes a novel attention-based neural network model to predict the complexity level of the surrounding driving environment. The proposed model takes naturalistic… ▽ More Environment perception is crucial for autonomous vehicle (AV) safety. Most existing AV perception algorithms have not studied the surrounding environment complexity and failed to include the environment complexity parameter. This paper proposes a novel attention-based neural network model to predict the complexity level of the surrounding driving environment. The proposed model takes naturalistic driving videos and corresponding vehicle dynamics parameters as input. It consists of a Yolo-v3 object detection algorithm, a heat map generation algorithm, CNN-based feature extractors, and attention-based feature extractors for both video and time-series vehicle dynamics data inputs to extract features. The output from the proposed algorithm is a surrounding environment complexity parameter. The Berkeley DeepDrive dataset (BDD Dataset) and subjectively labeled surrounding environment complexity levels are used for model training and validation to evaluate the algorithm. The proposed attention-based network achieves 91.22% average classification accuracy to classify the surrounding environment complexity. It proves that the environment complexity level can be accurately predicted and applied for future AVs' environment perception studies. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: Accepted by 2021 IEEE Intelligent Transportation Systems Conference

arXiv:2102.09971 [pdf, other]

Speech enhancement with weakly labelled data from AudioSet

Authors: Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang

Abstract: Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. We… ▽ More Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. Weakly labelled data only contain audio tags of audio clips, but not the onset or offset times of speech. We first apply pretrained audio neural networks (PANNs) to detect anchor segments that contain speech or sound events in audio clips. Then, we randomly mix two detected anchor segments containing speech and sound events as a mixture, and build a conditional source separation network using PANNs predictions as soft conditions for speech enhancement. In inference, we input a noisy speech signal with the one-hot encoding of "Speech" as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 5 pages

arXiv:2102.09966 [pdf, ps, other]

CatNet: music source separation system with mix-audio augmentation

Authors: Xuchen Song, Qiuqiang Kong, Xingjian Du, Yuxuan Wang

Abstract: Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. Recently, neural network based methods have been applied to address the MSS problem, and can be categorized into spectrogram and time-domain based methods. However, there is a lack of research of using complementary information of spectrogram and time-domain inputs for M… ▽ More Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. Recently, neural network based methods have been applied to address the MSS problem, and can be categorized into spectrogram and time-domain based methods. However, there is a lack of research of using complementary information of spectrogram and time-domain inputs for MSS. In this article, we propose a CatNet framework that concatenates a UNet separation branch using spectrogram as input and a WavUNet separation branch using time-domain waveform as input for MSS. We propose an end-to-end and fully differentiable system that incorporate spectrogram calculation into CatNet. In addition, we propose a novel mix-audio data augmentation method that randomly mix audio segments from the same source as augmented audio segments for training. Our proposed CatNet MSS system achieves a state-of-the-art vocals separation source distortion ratio (SDR) of 7.54 dB, outperforming MMDenseNet of 6.57 dB evaluated on the MUSDB18 dataset. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 5 pages

arXiv:2102.03603 [pdf, other]

On frequency- and time-limited H2-optimal model order reduction

Authors: Umair Zulfiqar, Victor Sreeram, Xin Du

Abstract: In this paper, the problems of frequency-limited and time-limited H2-optimal model order reduction of linear time-invariant systems are considered within the oblique projection framework. It is shown that it is inherently not possible to satisfy all the necessary conditions for the local minimizer in the oblique projection framework. The conditions for exact satisfaction of the optimality conditio… ▽ More In this paper, the problems of frequency-limited and time-limited H2-optimal model order reduction of linear time-invariant systems are considered within the oblique projection framework. It is shown that it is inherently not possible to satisfy all the necessary conditions for the local minimizer in the oblique projection framework. The conditions for exact satisfaction of the optimality conditions are also discussed. Further, the equivalence between the tangential interpolation conditions and the gramians-based necessary condition for the local optimum is established. Based on this equivalence, iterative algorithms that nearly satisfy these interpolation-based necessary conditions are proposed. The deviation in satisfaction of the optimality conditions decay as the order of the reduced-model is increased in the proposed algorithms. Moreover, stationary point iteration algorithms that satisfy two out of three necessary conditions for the local minimizer are also proposed. There also, the deviation in satisfaction of the third optimality conditions decay as the order of the reduced-model is increased in the proposed algorithms. The efficacy of the proposed algorithms is validated by considering one illustrative and three high-order models that are considered a benchmark for testing model order reduction algorithms. △ Less

Submitted 13 September, 2021; v1 submitted 6 February, 2021; originally announced February 2021.

arXiv:2101.06745 [pdf, other]

Frequency-weighted H2-optimal model order reduction via oblique projection

Authors: Umair Zulfiqar, Victor Sreeram, Mian Ilyas Ahmad, Xin Du

Abstract: In projection-based model order reduction, a reduced-order approximation of the original full-order system is obtained by projecting it onto a reduced subspace that contains its dominant characteristics. The problem of frequency-weighted H2-optimal model order reduction is to construct a local optimum in terms of the H2-norm of the weighted error transfer function. In this paper, a projection-base… ▽ More In projection-based model order reduction, a reduced-order approximation of the original full-order system is obtained by projecting it onto a reduced subspace that contains its dominant characteristics. The problem of frequency-weighted H2-optimal model order reduction is to construct a local optimum in terms of the H2-norm of the weighted error transfer function. In this paper, a projection-based model order reduction algorithm is proposed that constructs reduced-order models that nearly satisfy the first-order optimality conditions for the frequency-weighted H2-optimal model order reduction problem. It is shown that as the order of the reduced model is increased, the deviation in the satisfaction of the optimality conditions reduces further. Numerical methods are also discussed that improve the computational efficiency of the proposed algorithm. Three numerical examples are presented to demonstrate the efficacy of the proposed algorithm. △ Less

Submitted 2 May, 2021; v1 submitted 17 January, 2021; originally announced January 2021.

arXiv:2011.11020 [pdf, other]

Cryo-ZSSR: multiple-image super-resolution based on deep internal learning

Authors: Qinwen Huang, Ye Zhou, Xiaochen Du, Reed Chen, Jianyou Wang, Cynthia Rudin, Alberto Bartesaghi

Abstract: Single-particle cryo-electron microscopy (cryo-EM) is an emerging imaging modality capable of visualizing proteins and macro-molecular complexes at near-atomic resolution. The low electron-doses used to prevent sample radiation damage, result in images where the power of the noise is 100 times greater than the power of the signal. To overcome the low-SNRs, hundreds of thousands of particle project… ▽ More Single-particle cryo-electron microscopy (cryo-EM) is an emerging imaging modality capable of visualizing proteins and macro-molecular complexes at near-atomic resolution. The low electron-doses used to prevent sample radiation damage, result in images where the power of the noise is 100 times greater than the power of the signal. To overcome the low-SNRs, hundreds of thousands of particle projections acquired over several days of data collection are averaged in 3D to determine the structure of interest. Meanwhile, recent image super-resolution (SR) techniques based on neural networks have shown state of the art performance on natural images. Building on these advances, we present a multiple-image SR algorithm based on deep internal learning designed specifically to work under low-SNR conditions. Our approach leverages the internal image statistics of cryo-EM movies and does not require training on ground-truth data. When applied to a single-particle dataset of apoferritin, we show that the resolution of 3D structures obtained from SR micrographs can surpass the limits imposed by the imaging system. Our results indicate that the combination of low magnification imaging with image SR has the potential to accelerate cryo-EM data collection without sacrificing resolution. △ Less

Submitted 22 November, 2020; originally announced November 2020.

Comments: 11 pages, 4 figures

arXiv:2011.03988 [pdf, other]

Online power system parameter estimation and optimal operation

Authors: Xu Du, Alexander Engelmann, Timm Faulwasser, Boris Houska

Abstract: The integration of renewables into electrical grids calls for optimization-based control schemes requiring reliable grid models. Classically, parameter estimation and optimization-based control is often decoupled, which leads to high system operation cost in the estimation procedure. The present work proposes a method for simultaneously minimizing grid operation cost and optimally estimating line… ▽ More The integration of renewables into electrical grids calls for optimization-based control schemes requiring reliable grid models. Classically, parameter estimation and optimization-based control is often decoupled, which leads to high system operation cost in the estimation procedure. The present work proposes a method for simultaneously minimizing grid operation cost and optimally estimating line parameters based on methods for the optimal design of experiments. This method leads to a substantial reduction in cost for optimal estimation and in higher accuracy in the parameters compared with standard Optimal Power Flow and maximum-likelihood estimation. We illustrate the performance of the proposed method on a benchmark system. △ Less

Submitted 18 March, 2021; v1 submitted 8 November, 2020; originally announced November 2020.

arXiv:2010.14022 [pdf, other]

ByteCover: Cover Song Identification via Multi-Loss Training

Authors: Xingjian Du, Zhesong Yu, Bilei Zhu, Xiaoou Chen, Zejun Ma

Abstract: We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). ByteCover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks,… ▽ More We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). ByteCover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks, which are major components of our ResNet-IBN model. With the help of the IBN blocks, our CSI model can learn features that are invariant to the changes of musical attributes such as key, tempo, timbre and genre, while preserving the version information. In the second improvement, we employ the BNNeck method to allow a multi-loss training and encourage our method to jointly optimize a classification loss and a triplet loss, and by this means, the inter-class discrimination and intra-class compactness of cover songs, can be ensured at the same time. A set of experiments demonstrated the effectiveness and efficiency of ByteCover on multiple datasets, and in the Da-TACOS dataset, ByteCover outperformed the best competitive system by 20.9\%. △ Less

Submitted 23 April, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Showing 1–50 of 76 results for author: Du, X