Search | arXiv e-print repository

Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

Authors: Tao Zheng, Liejun Wang, Yinfeng Yu

Abstract: Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introd… ▽ More Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introduces a novel speech enhancement framework, HFSDA, which skillfully integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism to significantly enhance speech clarity and quality in noisy environments. By leveraging self-supervised learning embeddings in tandem with Short-Time Fourier Transform (STFT) spectrogram features, our model excels at capturing both high-level semantic information and detailed spectral data, enabling a more thorough analysis and refinement of speech signals. Furthermore, we employ the innovative Omni-dimensional Dynamic Convolution (ODConv) technology within the spectrogram input branch, enabling enhanced extraction and integration of crucial information across multiple dimensions. Additionally, we refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain. Extensive experiments on the VCTK-DEMAND dataset show that HFSDA is comparable to existing state-of-the-art models, confirming the validity of our approach. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

arXiv:2408.06906 [pdf, other]

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Authors: Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu

Abstract: Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-… ▽ More Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural vocoder network that incorporates full-band spectral information and introduces a Multi-Tier Discriminator (MTD) comprising multiple sub-discriminators to generate high-resolution signals. Additionally, we introduce an asymptotically constrained method that modifies the adversarial loss of the generator and discriminator, enhancing the stability of the training process. Through rigorous experiments, we demonstrate that the VNet model is capable of generating high-fidelity speech and significantly improving the performance of the vocoder. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

arXiv:2408.06851 [pdf, other]

BSS-CFFMA: Cross-Domain Feature Fusion and Multi-Attention Speech Enhancement Network based on Self-Supervised Embedding

Authors: Alimjan Mattursun, Liejun Wang, Yinfeng Yu

Abstract: Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-super… ▽ More Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-supervised embeddings. BSS-CFFMA comprises a multi-scale cross-domain feature fusion (MSCFF) block and a residual hybrid multi-attention (RHMA) block. The MSCFF block effectively integrates cross-domain features, facilitating the extraction of rich acoustic information. The RHMA block, serving as the primary enhancement module, utilizes three distinct attention modules to capture diverse attention representations and estimate high-quality speech signals. We evaluate the performance of the BSS-CFFMA model through comparative and ablation studies on the VoiceBank-DEMAND dataset, achieving SOTA results. Furthermore, we select three types of data from the WHAMR! dataset, a collection specifically designed for speech enhancement tasks, to assess the capabilities of BSS-CFFMA in tasks such as denoising only, dereverberation only, and simultaneous denoising and dereverberation. This study marks the first attempt to explore the effectiveness of self-supervised embedding-based speech enhancement methods in complex tasks encompassing dereverberation and simultaneous denoising and dereverberation. The demo implementation of BSS-CFFMA is available online\footnote[2]{https://github.com/AlimMat/BSS-CFFMA. \label{s1}}. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

arXiv:2407.16779 [pdf, other]

Learning Networked Dynamical System Models with Weak Form and Graph Neural Networks

Authors: Yin Yu, Daning Huang, Seho Park, Herschel C. Pangborn

Abstract: This paper presents a sequence of two approaches for the data-driven control-oriented modeling of networked systems, i.e., the systems that involve many interacting dynamical components. First, a novel deep learning approach named the weak Latent Dynamics Model (wLDM) is developed for learning generic nonlinear dynamics with control. Leveraging the weak form, the wLDM enables more numerically stab… ▽ More This paper presents a sequence of two approaches for the data-driven control-oriented modeling of networked systems, i.e., the systems that involve many interacting dynamical components. First, a novel deep learning approach named the weak Latent Dynamics Model (wLDM) is developed for learning generic nonlinear dynamics with control. Leveraging the weak form, the wLDM enables more numerically stable and computationally efficient training as well as more accurate prediction, when compared to conventional methods such as neural ordinary differential equations. Building upon the wLDM framework, we propose the weak Graph Koopman Bilinear Form (wGKBF) model, which integrates geometric deep learning and Koopman theory to learn latent space dynamics for networked systems, especially for the challenging cases having multiple timescales. The effectiveness of the wLDM framework and wGKBF model are demonstrated on three example systems of increasing complexity - a controlled double pendulum, the stiff Brusselator dynamics, and an electrified aircraft energy system. These numerical examples show that the wLDM and wGKBF achieve superior predictive accuracy and training efficiency as compared to baseline models. Parametric studies provide insights into the effects of hyperparameters in the weak form. The proposed framework shows the capability to efficiently capture control-dependent dynamics in these systems, including stiff dynamics and multi-physics interactions, offering a promising direction for learning control-oriented models of complex networked systems. △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2407.12380 [pdf, other]

PCQ: Emotion Recognition in Speech via Progressive Channel Querying

Authors: Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao

Abstract: In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogress… ▽ More In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: Accepted for publication by International Conference On Intelligent Computing 2024. For data and code, see <a href="https://github.com/ICIG/PCQ-Net">this https URL</a>

arXiv:2407.06525 [pdf, other]

UnmixingSR: Material-aware Network with Unsupervised Unmixing as Auxiliary Task for Hyperspectral Image Super-resolution

Authors: Yang Yu

Abstract: Deep learning-based (DL-based) hyperspectral image (HIS) super-resolution (SR) methods have achieved remarkable performance and attracted attention in industry and academia. Nonetheless, most current methods explored and learned the mapping relationship between low-resolution (LR) and high-resolution (HR) HSIs, leading to the side effect of increasing unreliability and irrationality in solving the… ▽ More Deep learning-based (DL-based) hyperspectral image (HIS) super-resolution (SR) methods have achieved remarkable performance and attracted attention in industry and academia. Nonetheless, most current methods explored and learned the mapping relationship between low-resolution (LR) and high-resolution (HR) HSIs, leading to the side effect of increasing unreliability and irrationality in solving the ill-posed SR problem. We find, quite interestingly, LR imaging is similar to the mixed pixel phenomenon. A single photodetector in sensor arrays receives the reflectance signals reflected by a number of classes, resulting in low spatial resolution and mixed pixel problems. Inspired by this observation, this paper proposes a component-aware HSI SR network called UnmixingSR, in which the unsupervised HU as an auxiliary task is used to perceive the material components of HSIs. We regard HU as an auxiliary task and incorporate it into the HSI SR process by exploring the constraints between LR and HR abundances. Instead of only learning the mapping relationship between LR and HR HSIs, we leverage the bond between LR abundances and HR abundances to boost the stability of our method in solving SR problems. Moreover, the proposed unmixing process can be embedded into existing deep SR models as a plug-in-play auxiliary task. Experimental results on hyperspectral experiments show that unmixing process as an auxiliary task incorporated into the SR problem is feasible and rational, achieving outstanding performance. The code is available at △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.01956 [pdf, other]

Cloud-Edge-Terminal Collaborative AIGC for Autonomous Driving

Authors: Jianan Zhang, Zhiwei Wei, Boxun Liu, Xiayi Wang, Yong Yu, Rongqing Zhang

Abstract: In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIG… ▽ More In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIGC for autonomous driving. By delving into the unique properties of AIGC services, this article initiates the attempts to construct mutually supportive AIGC and network systems for autonomous driving, including communication, storage and computation resource allocation schemes to support AIGC services, and leveraging AIGC to assist system design and resource management. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.00995 [pdf, other]

Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense

Authors: Yi Yu, Shengyue Yao, Tianchen Zhou, Yexuan Fu, Jingru Yu, Ding Wang, Xuhong Wang, Cen Chen, Yilun Lin

Abstract: In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, an… ▽ More In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, and Artificial Intelligent (AI) agents. The DTM platform supports evident-based data value evaluation and AI-based trading mechanisms. Leveraging the common sense capabilities of Large Language Models (LLMs) to assess traffic state and data value, DTM can determine reasonable traffic data pricing through multi-round interaction and simulations. Moreover, DTM provides a pricing method validation by simulating traffic systems, multi-agent interactions, and the heterogeneity and irrational behaviors of individuals in the trading market. Within the DTM platform, entities such as connected vehicles and traffic light controllers could engage in information collecting, data pricing, trading, and decision-making. Simulation results demonstrate that our proposed AI agent-based pricing approach enhances data trading by offering rational prices, as evidenced by the observed improvement in traffic efficiency. This underscores the effectiveness and practical value of DTM, offering new perspectives for the evolution of data markets and smart cities. To the best of our knowledge, this is the first study employing LLMs in data pricing and a pioneering data trading practice in the field of intelligent vehicles and smart cities. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.08761 [pdf, other]

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Authors: Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe

Abstract: Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr… ▽ More Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 4 pages, 2 figures

arXiv:2405.09572 [pdf, other]

Deep Neural Operator Enabled Digital Twin Modeling for Additive Manufacturing

Authors: Ning Liu, Xuxiao Li, Manoj R. Rajanna, Edward W. Reutzel, Brady Sawyer, Prahalada Rao, Jim Lua, Nam Phan, Yue Yu

Abstract: A digital twin (DT), with the components of a physics-based model, a data-driven model, and a machine learning (ML) enabled efficient surrogate, behaves as a virtual twin of the real-world physical process. In terms of Laser Powder Bed Fusion (L-PBF) based additive manufacturing (AM), a DT can predict the current and future states of the melt pool and the resulting defects corresponding to the inp… ▽ More A digital twin (DT), with the components of a physics-based model, a data-driven model, and a machine learning (ML) enabled efficient surrogate, behaves as a virtual twin of the real-world physical process. In terms of Laser Powder Bed Fusion (L-PBF) based additive manufacturing (AM), a DT can predict the current and future states of the melt pool and the resulting defects corresponding to the input laser parameters, evolve itself by assimilating in-situ sensor data, and optimize the laser parameters to mitigate defect formation. In this paper, we present a deep neural operator enabled computational framework of the DT for closed-loop feedback control of the L-PBF process. This is accomplished by building a high-fidelity computational model to accurately represent the melt pool states, an efficient surrogate model to approximate the melt pool solution field, followed by an physics-based procedure to extract information from the computed melt pool simulation that can further be correlated to the defect quantities of interest (e.g., surface roughness). In particular, we leverage the data generated from the high-fidelity physics-based model and train a series of Fourier neural operator (FNO) based ML models to effectively learn the relation between the input laser parameters and the corresponding full temperature field of the melt pool. Subsequently, a set of physics-informed variables such as the melt pool dimensions and the peak temperature can be extracted to compute the resulting defects. An optimization algorithm is then exercised to control laser input and minimize defects. On the other hand, the constructed DT can also evolve with the physical twin via offline finetuning and online material calibration. Finally, a probabilistic framework is adopted for uncertainty quantification. The developed DT is envisioned to guide the AM process and facilitate high-quality manufacturing. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.04867 [pdf, other]

MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhijing Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Haijin Zeng, Kai Feng , et al. (24 additional authors not shown)

Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

arXiv:2405.02963 [pdf]

Preventive Audits for Data Applications Before Data Sharing in the Power IoT

Authors: Bohong Wang, Qinglai Guo, Yanxi Lin, Yang Yu

Abstract: With the increase in data volume, more types of data are being used and shared, especially in the power Internet of Things (IoT). However, the processes of data sharing may lead to unexpected information leakage because of the ubiquitous relevance among the different data, thus it is necessary for data owners to conduct preventive audits for data applications before data sharing to avoid the risk… ▽ More With the increase in data volume, more types of data are being used and shared, especially in the power Internet of Things (IoT). However, the processes of data sharing may lead to unexpected information leakage because of the ubiquitous relevance among the different data, thus it is necessary for data owners to conduct preventive audits for data applications before data sharing to avoid the risk of key information leakage. Considering that the same data may play completely different roles in different application scenarios, data owners should know the expected data applications of the data buyers in advance and provide modified data that are less relevant to the private information of the data owners and more relevant to the nonprivate information that the data buyers need. In this paper, data sharing in the power IoT is regarded as the background, and the mutual information of the data and their implicit information is selected as the data feature parameter to indicate the relevance between the data and their implicit information or the ability to infer the implicit information from the data. Therefore, preventive audits should be conducted based on changes in the data feature parameters before and after data sharing. The probability exchange adjustment method is proposed as the theoretical basis of preventive audits under simplified consumption, and the corresponding optimization models are constructed and extended to more practical scenarios with multivariate characteristics. Finally, case studies are used to validate the effectiveness of the proposed preventive audits. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: 19 pages, 18 figures

arXiv:2404.18418 [pdf, other]

Decomposition Model Assisted Energy-Saving Design in Radio Access Network

Authors: Xiaoxue Zhao, Yijun Yu, Yexing Li, Dong Li, Yao Wang, Chungang Yang

Abstract: The continuous emergence of novel services and massive connections involve huge energy consumption towards ultra-dense radio access networks. Moreover, there exist much more number of controllable parameters that can be adjusted to reduce the energy consumption from a network-wide perspective. However, a network-level energy-saving intent usually contains multiple network objectives and constraint… ▽ More The continuous emergence of novel services and massive connections involve huge energy consumption towards ultra-dense radio access networks. Moreover, there exist much more number of controllable parameters that can be adjusted to reduce the energy consumption from a network-wide perspective. However, a network-level energy-saving intent usually contains multiple network objectives and constraints. Therefore, it is critical to decompose a network-level energy-saving intent into multiple levels of configurated operations from a top-down refinement perspective. In this work, we utilize a softgoal interdependency graph decomposition model to assist energy-saving scheme design. Meanwhile, we propose an energy-saving approach based on deep Q-network, which achieve a better trade-off among the energy consumption, the throughput, and the first packet delay. In addition, we illustrate how the decomposition model can assist in making energy-saving decisions. Evaluation results demonstrate the performance gain of the proposed scheme in accelerating the model training process. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18176 [pdf, other]

Auto-Optimized Maximum Torque Per Ampere Control of IPMSM Using Dual Control for Exploration and Exploitation

Authors: Yuefei Zuo, Yalei Yu, Jun Yang, Wen-Hua Chen

Abstract: In this paper, a maximum torque per ampere (MTPA) control strategy for the interior permanent magnet synchronous motor (IPMSM) using dual control for exploration and exploitation (DCEE). In the proposed method, the permanent magnet flux and the difference between the $d$- and $q$-axis inductance are identified by multiple estimators using the recursive least square method. The initial values of th… ▽ More In this paper, a maximum torque per ampere (MTPA) control strategy for the interior permanent magnet synchronous motor (IPMSM) using dual control for exploration and exploitation (DCEE). In the proposed method, the permanent magnet flux and the difference between the $d$- and $q$-axis inductance are identified by multiple estimators using the recursive least square method. The initial values of the estimated parameters in different estimators are different. By using multiple estimators, exploration of the operational environment to reduce knowledge uncertainty can be realized. Compared to those MTPA control strategies based on the extremum-seeking method, the proposed method has better dynamic performance when speed or load varies. The effectiveness of the proposed method is verified by simulations. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.13789 [pdf, other]

Anchor-aware Deep Metric Learning for Audio-visual Retrieval

Authors: Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu

Abstract: Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However… ▽ More Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 9 pages, 5 figures. Accepted by ACM ICMR 2024

arXiv:2404.13509 [pdf, ps, other]

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

Authors: Xinxin Jiao, Liejun Wang, Yinfeng Yu

Abstract: Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spe… ▽ More Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: Main paper (5 pages). Accepted for publication by ICME 2024

arXiv:2404.09007 [pdf, other]

A Framework for Safe Probabilistic Invariance Verification of Stochastic Dynamical Systems

Authors: Taoran Wu, Yiqing Yu, Bican Xia, Ji Wang, Bai Xue

Abstract: Ensuring safety through set invariance has proven to be a valuable method in various robotics and control applications. This paper introduces a comprehensive framework for the safe probabilistic invariance verification of both discrete- and continuous-time stochastic dynamical systems over an infinite time horizon. The objective is to ascertain the lower and upper bounds of liveness probabilities… ▽ More Ensuring safety through set invariance has proven to be a valuable method in various robotics and control applications. This paper introduces a comprehensive framework for the safe probabilistic invariance verification of both discrete- and continuous-time stochastic dynamical systems over an infinite time horizon. The objective is to ascertain the lower and upper bounds of liveness probabilities for a given safe set and set of initial states. The liveness probability signifies the likelihood of the system remaining within the safe set indefinitely, starting from a state in the initial set. To address this problem, we propose optimizations for verifying safe probabilistic invariance in discrete-time and continuous-time stochastic dynamical systems. These optimizations are constructed via either using the Doob's nonnegative supermartingale inequality-based method or relaxing the equations described in [30,32], which can precisely characterize the probability of reaching a target set while avoiding unsafe states. Finally, we demonstrate the effectiveness of these optimizations through several examples using semi-definite programming tools. △ Less

Submitted 3 August, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

arXiv:2403.17770 [pdf, other]

CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation

Authors: Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang

Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node… ▽ More Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.14250 [pdf, other]

Safeguarding Medical Image Segmentation Datasets against Unauthorized Training via Contour- and Texture-Aware Perturbations

Authors: Xun Lin, Yi Yu, Song Xia, Jue Jiang, Haoran Wang, Zitong Yu, Yizhong Liu, Ying Fu, Shuai Wang, Wenzhong Tang, Alex Kot

Abstract: The widespread availability of publicly accessible medical images has significantly propelled advancements in various research and clinical fields. Nonetheless, concerns regarding unauthorized training of AI systems for commercial purposes and the duties of patient privacy protection have led numerous institutions to hesitate to share their images. This is particularly true for medical image segme… ▽ More The widespread availability of publicly accessible medical images has significantly propelled advancements in various research and clinical fields. Nonetheless, concerns regarding unauthorized training of AI systems for commercial purposes and the duties of patient privacy protection have led numerous institutions to hesitate to share their images. This is particularly true for medical image segmentation (MIS) datasets, where the processes of collection and fine-grained annotation are time-intensive and laborious. Recently, Unlearnable Examples (UEs) methods have shown the potential to protect images by adding invisible shortcuts. These shortcuts can prevent unauthorized deep neural networks from generalizing. However, existing UEs are designed for natural image classification and fail to protect MIS datasets imperceptibly as their protective perturbations are less learnable than important prior knowledge in MIS, e.g., contour and texture features. To this end, we propose an Unlearnable Medical image generation method, termed UMed. UMed integrates the prior knowledge of MIS by injecting contour- and texture-aware perturbations to protect images. Given that our target is to only poison features critical to MIS, UMed requires only minimal perturbations within the ROI and its contour to achieve greater imperceptibility (average PSNR is 50.03) and protective performance (clean average DSC degrades from 82.18% to 6.80%). △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.12487 [pdf]

Unveiling Four Key Factors for Tire Force Control Allocation in 4WID-4WIS Electric Vehicles at Handling Limits

Authors: Ao Lu, Runfeng Li, Yunchang Yu, Ziwang Lu, Guangyu Tian

Abstract: The four-wheel independent drive and four-wheel independent steering (4WID-4WIS) configurations enhance control flexibility and dynamic performance potential for more integrated electric vehicles. This paper comprehensively analyzes the impacts of four key factors on tire force control allocation: vertical load estimation, actuator dynamic characteristics, tire force constraints, and wheel steerin… ▽ More The four-wheel independent drive and four-wheel independent steering (4WID-4WIS) configurations enhance control flexibility and dynamic performance potential for more integrated electric vehicles. This paper comprehensively analyzes the impacts of four key factors on tire force control allocation: vertical load estimation, actuator dynamic characteristics, tire force constraints, and wheel steering precision at handling limits. The study demonstrates that precise vertical load estimation enhances lateral force allocation accuracy. Additionally, the self-compensating effect of lateral tire forces minimizes the impact of small deviations in vertical load estimation on tire force control allocation. A novel control allocation method considering actuator dynamics is introduced, effectively improving yaw rate response and reducing tracking errors. Considering tire-road adhesion and actuator rate constraints, an innovative method to calculate the real-time attainable tire force volume is proposed based on the tire slip ratio and slip angle. Feedforward control with bump steer compensation is implemented to improve wheel steering precision and lateral tire force control accuracy. Matlab/Simulink and Carsim co-simulation results emphasize the importance of these key factors' individual impacts and combined effects. This analysis offers valuable insights for developing advanced tire force control allocation strategies in 4WID-4WIS electric vehicles. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.10384 [pdf, other]

Coordination in Noncooperative Multiplayer Matrix Games via Reduced Rank Correlated Equilibria

Authors: Jaehan Im, Yue Yu, David Fridovich-Keil, Ufuk Topcu

Abstract: Coordination in multiplayer games enables players to avoid the lose-lose outcome that often arises at Nash equilibria. However, designing a coordination mechanism typically requires the consideration of the joint actions of all players, which becomes intractable in large-scale games. We develop a novel coordination mechanism, termed reduced rank correlated equilibria, which reduces the number of j… ▽ More Coordination in multiplayer games enables players to avoid the lose-lose outcome that often arises at Nash equilibria. However, designing a coordination mechanism typically requires the consideration of the joint actions of all players, which becomes intractable in large-scale games. We develop a novel coordination mechanism, termed reduced rank correlated equilibria, which reduces the number of joint actions to be considered and thereby mitigates computational complexity. The idea is to approximate the set of all joint actions with the actions used in a set of pre-computed Nash equilibria via a convex hull operation. In a game with n players and each player having m actions, the proposed mechanism reduces the number of joint actions considered from O(m^n) to O(mn). We demonstrate the application of the proposed mechanism to an air traffic queue management problem. Compared with the correlated equilibrium-a popular benchmark coordination mechanism-the proposed approach is capable of solving a problem involving four thousand times more joint actions while yielding similar or better performance in terms of a fairness indicator and showing a maximum optimality gap of 0.066% in terms of the average delay cost. In the meantime, it yields a solution that shows up to 99.5% improvement in a fairness indicator and up to 50.4% reduction in average delay cost compared to the Nash solution, which does not involve coordination. △ Less

Submitted 12 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.10064 [pdf, other]

Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI

Authors: Chong Wang, Lanqing Guo, Yufei Wang, Hao Cheng, Yi Yu, Bihan Wen

Abstract: Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this wo… ▽ More Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this work, we propose a Progressive Divide-And-Conquer (PDAC) strategy, aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI, we present a rigorous derivation of the proposed PDAC framework, which could be further unfolded into an end-to-end trainable network. Specifically, each iterative stage in PDAC focuses on recovering a distinct moderate degradation according to the decomposition. Furthermore, as part of the PDAC iteration, such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction, the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.09693 [pdf, other]

A Constrained Deep Reinforcement Learning Optimization for Reliable Network Slicing in a Blockchain-Secured Low-Latency Wireless Network

Authors: Xin Hao, Phee Lep Yeoh, Changyang She, Yao Yu, Branka Vucetic, Yonghui Li

Abstract: Network slicing (NS) is a promising technology that supports diverse requirements for next-generation low-latency wireless communication networks. However, the tampering attack is a rising issue of jeopardizing NS service-provisioning. To resist tampering attacks in NS networks, we propose a novel optimization framework for reliable NS resource allocation in a blockchain-secured low-latency wirele… ▽ More Network slicing (NS) is a promising technology that supports diverse requirements for next-generation low-latency wireless communication networks. However, the tampering attack is a rising issue of jeopardizing NS service-provisioning. To resist tampering attacks in NS networks, we propose a novel optimization framework for reliable NS resource allocation in a blockchain-secured low-latency wireless network, where trusted base stations (BSs) with high reputations are selected for blockchain management and NS service-provisioning. For such a blockchain-secured network, we consider that the latency is measured by the summation of blockchain management and NS service-provisioning, whilst the NS reliability is evaluated by the BS denial-of-service (DoS) probability. To satisfy the requirements of both the latency and reliability, we formulate a constrained computing resource allocation optimization problem to minimize the total processing latency subject to the BS DoS probability. To efficiently solve the optimization, we design a constrained deep reinforcement learning (DRL) algorithm, which satisfies both latency and DoS probability requirements by introducing an additional critic neural network. The proposed constrained DRL further solves the issue of high input dimension by incorporating feature engineering technology. Simulation results validate the effectiveness of our approach in achieving reliable and low-latency NS service-provisioning in the considered blockchain-secured wireless network. △ Less

Submitted 16 February, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2312.08016

arXiv:2403.09407 [pdf, other]

LM2D: Lyrics- and Music-Driven Dance Synthesis

Authors: Wenjie Yin, Xuejiao Zhao, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman

Abstract: Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on au… ▽ More Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: https://youtu.be/4XCgvYookvA. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.09157 [pdf, ps, other]

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

Authors: Mingya Zhang, Yue Yu, Limei Gu, Tingsheng Lin, Xianping Tao

Abstract: In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. Recently, State Space Models… ▽ More In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. Recently, State Space Models (SSMs), such as Mamba, have been recognized as a promising method. They not only demonstrate superior performance in modeling long-range interactions, but also preserve a linear computational complexity. Inspired by the Mamba architecture, We proposed Vison Mamba-UNetV2, the Visual State Space (VSS) Block is introduced to capture extensive contextual information, the Semantics and Detail Infusion (SDI) is introduced to augment the infusion of low-level and high-level features. We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB and ETIS-LaribPolypDB public datasets. The results indicate that VM-UNetV2 exhibits competitive performance in medical image segmentation tasks. Our code is available at https://github.com/nobodyplayer1/VM-UNetV2. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 12 pages, 4 figures

arXiv:2402.17246 [pdf, other]

SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging

Authors: Meng Lou, Hanning Ying, Xiaoqing Liu, Hong-Yu Zhou, Yuqing Zhang, Yizhou Yu

Abstract: Automated classification of liver lesions in multi-phase CT and MR scans is of clinical significance but challenging. This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework, specifically designed for liver lesion classification in 3D multi-phase CT and MR imaging with varying phase counts. The proposed SDR-Former utilizes a streamlined Siamese Neural Network (SNN) t… ▽ More Automated classification of liver lesions in multi-phase CT and MR scans is of clinical significance but challenging. This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework, specifically designed for liver lesion classification in 3D multi-phase CT and MR imaging with varying phase counts. The proposed SDR-Former utilizes a streamlined Siamese Neural Network (SNN) to process multi-phase imaging inputs, possessing robust feature representations while maintaining computational efficiency. The weight-sharing feature of the SNN is further enriched by a hybrid Dual-Resolution Transformer (DR-Former), comprising a 3D Convolutional Neural Network (CNN) and a tailored 3D Transformer for processing high- and low-resolution images, respectively. This hybrid sub-architecture excels in capturing detailed local features and understanding global contextual information, thereby, boosting the SNN's feature extraction capabilities. Additionally, a novel Adaptive Phase Selection Module (APSM) is introduced, promoting phase-specific intercommunication and dynamically adjusting each phase's influence on the diagnostic outcome. The proposed SDR-Former framework has been validated through comprehensive experiments on two clinical datasets: a three-phase CT dataset and an eight-phase MR dataset. The experimental results affirm the efficacy of the proposed framework. To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public. This pioneering dataset, being the first publicly available multi-phase MR dataset in this field, also underpins the MICCAI LLD-MMRI Challenge. The dataset is accessible at:https://bit.ly/3IyYlgN. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 13 pages, 7 figures

arXiv:2402.03302 [pdf, other]

Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining

Authors: Jiarun Liu, Hao Yang, Hong-Yu Zhou, Yan Xi, Lequan Yu, Yizhou Yu, Yong Liang, Guangming Shi, Shaoting Zhang, Hairong Zheng, Shanshan Wang

Abstract: Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their a… ▽ More Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their attention mechanism. Recently, Mamba-based models have gained great attention for their impressive ability in long sequence modeling. Several studies have demonstrated that these models can outperform popular vision models in various tasks, offering higher accuracy, lower memory consumption, and less computational burden. However, existing Mamba-based models are mostly trained from scratch and do not explore the power of pretraining, which has been proven to be quite effective for data-efficient medical image analysis. This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks, leveraging the advantages of ImageNet-based pretraining. Our experimental results reveal the vital role of ImageNet-based training in enhancing the performance of Mamba-based models. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba_Enc by an average score of 2.72%. △ Less

Submitted 6 March, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Code and models of Swin-UMamba are publicly available at: https://github.com/JiarunLiu/Swin-UMamba

arXiv:2401.17619 [pdf, ps, other]

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatu… ▽ More In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet}) △ Less

Submitted 12 June, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: Accepted by Interspeech2024

arXiv:2401.10447 [pdf, other]

Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.06224 [pdf, other]

Leveraging Frequency Domain Learning in 3D Vessel Segmentation

Authors: Xinyuan Wang, Chengwei Pan, Hongming Dai, Gangming Zhao, Jinpeng Li, Xiao Zhang, Yizhou Yu

Abstract: Coronary microvascular disease constitutes a substantial risk to human health. Employing computer-aided analysis and diagnostic systems, medical professionals can intervene early in disease progression, with 3D vessel segmentation serving as a crucial component. Nevertheless, conventional U-Net architectures tend to yield incoherent and imprecise segmentation outcomes, particularly for small vesse… ▽ More Coronary microvascular disease constitutes a substantial risk to human health. Employing computer-aided analysis and diagnostic systems, medical professionals can intervene early in disease progression, with 3D vessel segmentation serving as a crucial component. Nevertheless, conventional U-Net architectures tend to yield incoherent and imprecise segmentation outcomes, particularly for small vessel structures. While models with attention mechanisms, such as Transformers and large convolutional kernels, demonstrate superior performance, their extensive computational demands during training and inference lead to increased time complexity. In this study, we leverage Fourier domain learning as a substitute for multi-scale convolutional kernels in 3D hierarchical segmentation models, which can reduce computational expenses while preserving global receptive fields within the network. Furthermore, a zero-parameter frequency domain fusion method is designed to improve the skip connections in U-Net architecture. Experimental results on a public dataset and an in-house dataset indicate that our novel Fourier transformation-based network achieves remarkable dice performance (84.37\% on ASACA500 and 80.32\% on ImageCAS) in tubular vessel segmentation tasks and substantially reduces computational requirements without compromising global receptive fields. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.02884 [pdf]

MsDC-DEQ-Net: Deep Equilibrium Model (DEQ) with Multi-scale Dilated Convolution for Image Compressive Sensing (CS)

Authors: Youhao Yu, Richard M. Dansereau

Abstract: Compressive sensing (CS) is a technique that enables the recovery of sparse signals using fewer measurements than traditional sampling methods. To address the computational challenges of CS reconstruction, our objective is to develop an interpretable and concise neural network model for reconstructing natural images using CS. We achieve this by mapping one step of the iterative shrinkage threshold… ▽ More Compressive sensing (CS) is a technique that enables the recovery of sparse signals using fewer measurements than traditional sampling methods. To address the computational challenges of CS reconstruction, our objective is to develop an interpretable and concise neural network model for reconstructing natural images using CS. We achieve this by mapping one step of the iterative shrinkage thresholding algorithm (ISTA) to a deep network block, representing one iteration of ISTA. To enhance learning ability and incorporate structural diversity, we integrate aggregated residual transformations (ResNeXt) and squeeze-and-excitation (SE) mechanisms into the ISTA block. This block serves as a deep equilibrium layer, connected to a semi-tensor product network (STP-Net) for convenient sampling and providing an initial reconstruction. The resulting model, called MsDC-DEQ-Net, exhibits competitive performance compared to state-of-the-art network-based methods. It significantly reduces storage requirements compared to deep unrolling methods, using only one iteration block instead of multiple iterations. Unlike deep unrolling models, MsDC-DEQ-Net can be iteratively used, gradually improving reconstruction accuracy while considering computation trade-offs. Additionally, the model benefits from multi-scale dilated convolutions, further enhancing performance. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 15 pages, 8 figures, open access journal paper

arXiv:2401.00806 [pdf, other]

Noise-Aware and Equitable Urban Air Traffic Management: An Optimization Approach

Authors: Zhenyu Gao, Yue Yu, Qinshuang Wei, Ufuk Topcu, John-Paul Clarke

Abstract: Urban air mobility (UAM), a transformative concept for the transport of passengers and cargo, faces several integration challenges in complex urban environments. Community acceptance of aircraft noise is among the most noticeable of these challenges when launching or scaling up a UAM system. Properly managing community noise is fundamental to establishing a UAM system that is environmentally and s… ▽ More Urban air mobility (UAM), a transformative concept for the transport of passengers and cargo, faces several integration challenges in complex urban environments. Community acceptance of aircraft noise is among the most noticeable of these challenges when launching or scaling up a UAM system. Properly managing community noise is fundamental to establishing a UAM system that is environmentally and socially sustainable. In this work, we develop a holistic and equitable approach to manage UAM air traffic and its community noise impact in urban environments. The proposed approach is a hybrid approach that considers a mix of different noise mitigation strategies, including limiting the number of operations, cruising at higher altitudes, and ambient noise masking. We tackle the problem through the lens of network system control and formulate a multi-objective optimization model for managing traffic flow in a multi-layer UAM network while concurrently pursuing demand fulfillment, noise control, and energy saving. Further, we use a social welfare function in the optimization model as the basis for the efficiency-fairness trade-off in both demand fulfillment and noise control. We apply the proposed approach to a comprehensive case study in the city of Austin and perform design trade-offs through both visual and quantitative analyses. △ Less

Submitted 1 January, 2024; originally announced January 2024.

Comments: 30 pages, 15 figures

arXiv:2310.09603 [pdf, other]

B-Spine: Learning B-Spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation

Authors: Hao Wang, Qiang Song, Ruofeng Yin, Rui Ma, Yizhou Yu, Yi Chang

Abstract: Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we pr… ▽ More Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low-quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve-based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed CJUH-JLU dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation. △ Less

Submitted 14 October, 2023; originally announced October 2023.

arXiv:2310.05368 [pdf, other]

Measuring Acoustics with Collaborative Multiple Agents

Authors: Yinfeng Yu, Changan Chen, Lele Cao, Fangkai Yang, Fuchun Sun

Abstract: As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by set… ▽ More As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment's acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment's acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: Main paper (9 pages and 5 figures and 2 tables) and appendix (16 pages and 13 figures and 10 tables). Accepted for publication by IJCAI 2023

arXiv:2310.00455 [pdf, other]

Music- and Lyrics-driven Dance Synthesis

Authors: Wenjie Yin, Qingyuan Yao, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman

Abstract: Lyrics often convey information about the songs that are beyond the auditory dimension, enriching the semantic meaning of movements and musical themes. Such insights are important in the dance choreography domain. However, most existing dance synthesis methods mainly focus on music-to-dance generation, without considering the semantic information. To complement it, we introduce JustLMD, a new mult… ▽ More Lyrics often convey information about the songs that are beyond the auditory dimension, enriching the semantic meaning of movements and musical themes. Such insights are important in the dance choreography domain. However, most existing dance synthesis methods mainly focus on music-to-dance generation, without considering the semantic information. To complement it, we introduce JustLMD, a new multimodal dataset of 3D dance motion with music and lyrics. To the best of our knowledge, this is the first dataset with triplet information including dance motion, music, and lyrics. Additionally, we showcase a cross-modal diffusion-based network designed to generate 3D dance motion conditioned on music and lyrics. The proposed JustLMD dataset encompasses 4.6 hours of 3D dance motion in 1867 sequences, accompanied by musical tracks and their corresponding English lyrics. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2309.15223 [pdf, other]

doi 10.1109/ASRU57964.2023.10389632

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6. △ Less

Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

arXiv:2309.09409 [pdf]

Improving Axial Resolution of Optical Resolution Photoacoustic Microscopy with Advanced Frequency Domain Eigenspace Based Minimum Variance Beamforming Method

Authors: Yu-Hsiang Yu, Meng-Lin Li

Abstract: Optical resolution photoacoustic microscopy (OR-PAM) leverages optical focusing and acoustic detection for microscopic optical absorption imaging. Intrinsically it owns high optical lateral resolution and poor acoustic axial resolution. Such anisometric resolution hinders good 3-D visualization; thus 2-D maximum amplitude projection images are commonly presented in the literature. Since its axial… ▽ More Optical resolution photoacoustic microscopy (OR-PAM) leverages optical focusing and acoustic detection for microscopic optical absorption imaging. Intrinsically it owns high optical lateral resolution and poor acoustic axial resolution. Such anisometric resolution hinders good 3-D visualization; thus 2-D maximum amplitude projection images are commonly presented in the literature. Since its axial resolution is limited by the bandwidth of acoustic detectors, ultrahigh frequency, and wideband detectors with Wiener deconvolution have been proposed to address this issue. Nonetheless, they also introduce other issues such as severe high-frequency attenuation and limited imaging depth. In this work, we view axial resolution improvement as an axial signal reconstruction problem, and the axial resolution degradation is caused by axial sidelobe interference. We propose an advanced frequency-domain eigenspace-based minimum variance (F-EIBMV) beamforming technique to suppress axial sidelobe interference and noises. This method can simultaneously enhance the axial resolution and contrast of OR-PAM. For a 25-MHz OR-PAM system, the full-width at half-maximum of an axial point spread function decreased significantly from 69.3 $μ$m to 16.89 $μ$m, indicating a significant improvement in axial resolution. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2308.13789 [pdf]

Sensiverse: A dataset for ISAC study

Authors: Jiajin Luo, Baojian Zhou, Yang Yu, Ping Zhang, Xiaohui Peng, Jianglei Ma, Peiying Zhu, Jianmin Lu, Wen Tong

Abstract: In order to address the lack of applicable channel models for ISAC research and evaluation, we release Sensiverse, a dataset that can be used for ISAC research. In this paper, we present the method of generating Sensiverse, including the acquisition and formatting of the 3D scene models, the generation of the channel data and associations with Tx/Rx deployment. The file structure and usage of the… ▽ More In order to address the lack of applicable channel models for ISAC research and evaluation, we release Sensiverse, a dataset that can be used for ISAC research. In this paper, we present the method of generating Sensiverse, including the acquisition and formatting of the 3D scene models, the generation of the channel data and associations with Tx/Rx deployment. The file structure and usage of the dataset are also described, and finally the use of the dataset is illustrated with examples through the evaluation of use cases such as 3D environment reconstruction and moving targets. △ Less

Submitted 26 August, 2023; originally announced August 2023.

arXiv:2308.11644 [pdf, other]

Synergistic Signal Denoising for Multimodal Time Series of Structure Vibration

Authors: Yang Yu, Han Chen

Abstract: Structural Health Monitoring (SHM) plays an indispensable role in ensuring the longevity and safety of infrastructure. With the rapid growth of sensor technology, the volume of data generated from various structures has seen an unprecedented surge, bringing forth challenges in efficient analysis and interpretation. This paper introduces a novel deep learning algorithm tailored for the complexities… ▽ More Structural Health Monitoring (SHM) plays an indispensable role in ensuring the longevity and safety of infrastructure. With the rapid growth of sensor technology, the volume of data generated from various structures has seen an unprecedented surge, bringing forth challenges in efficient analysis and interpretation. This paper introduces a novel deep learning algorithm tailored for the complexities inherent in multimodal vibration signals prevalent in SHM. By amalgamating convolutional and recurrent architectures, the algorithm adeptly captures both localized and prolonged structural behaviors. The pivotal integration of attention mechanisms further enhances the model's capability, allowing it to discern and prioritize salient structural responses from extraneous noise. Our results showcase significant improvements in predictive accuracy, early damage detection, and adaptability across multiple SHM scenarios. In light of the critical nature of SHM, the proposed approach not only offers a robust analytical tool but also paves the way for more transparent and interpretable AI-driven SHM solutions. Future prospects include real-time processing, integration with external environmental factors, and a deeper emphasis on model interpretability. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.08197 [pdf, other]

Self-Reference Deep Adaptive Curve Estimation for Low-Light Image Enhancement

Authors: Jianyu Wen, Chenhao Wu, Tong Zhang, Yixuan Yu, Piotr Swierczynski

Abstract: In this paper, we propose a 2-stage low-light image enhancement method called Self-Reference Deep Adaptive Curve Estimation (Self-DACE). In the first stage, we present an intuitive, lightweight, fast, and unsupervised luminance enhancement algorithm. The algorithm is based on a novel low-light enhancement curve that can be used to locally boost image brightness. We also propose a new loss function… ▽ More In this paper, we propose a 2-stage low-light image enhancement method called Self-Reference Deep Adaptive Curve Estimation (Self-DACE). In the first stage, we present an intuitive, lightweight, fast, and unsupervised luminance enhancement algorithm. The algorithm is based on a novel low-light enhancement curve that can be used to locally boost image brightness. We also propose a new loss function with a simplified physical model designed to preserve natural images' color, structure, and fidelity. We use a vanilla CNN to map each pixel through deep Adaptive Adjustment Curves (AAC) while preserving the local image structure. Secondly, we introduce the corresponding denoising scheme to remove the latent noise in the darkness. We approximately model the noise in the dark and deploy a Denoising-Net to estimate and remove the noise after the first stage. Exhaustive qualitative and quantitative analysis shows that our method outperforms existing state-of-the-art algorithms on multiple real-world datasets. △ Less

Submitted 10 September, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.08017 [pdf, other]

Active Inverse Learning in Stackelberg Trajectory Games

Authors: Yue Yu, Jacob Levy, Negar Mehr, David Fridovich-Keil, Ufuk Topcu

Abstract: Game-theoretic inverse learning is the problem of inferring the players' objectives from their actions. We formulate an inverse learning problem in a Stackelberg game between a leader and a follower, where each player's action is the trajectory of a dynamical system. We propose an active inverse learning method for the leader to infer which hypothesis among a finite set of candidates describes the… ▽ More Game-theoretic inverse learning is the problem of inferring the players' objectives from their actions. We formulate an inverse learning problem in a Stackelberg game between a leader and a follower, where each player's action is the trajectory of a dynamical system. We propose an active inverse learning method for the leader to infer which hypothesis among a finite set of candidates describes the follower's objective function. Instead of using passively observed trajectories like existing methods, the proposed method actively maximizes the differences in the follower's trajectories under different hypotheses to accelerate the leader's inference. We demonstrate the proposed method in a receding-horizon repeated trajectory game. Compared with uniformly random inputs, the leader inputs provided by the proposed method accelerate the convergence of the probability of different hypotheses conditioned on the follower's trajectory by orders of magnitude. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.03769 [pdf, other]

Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization

Authors: Shengyue Yao, Jingru Yu, Yi Yu, Jia Xu, Xingyuan Dai, Honghai Li, Fei-Yue Wang, Yilun Lin

Abstract: With a growing complexity of the intelligent traffic system (ITS), an integrated control of ITS that is capable of considering plentiful heterogeneous intelligent agents is desired. However, existing control methods based on the centralized or the decentralized scheme have not presented their competencies in considering the optimality and the scalability simultaneously. To address this issue, we p… ▽ More With a growing complexity of the intelligent traffic system (ITS), an integrated control of ITS that is capable of considering plentiful heterogeneous intelligent agents is desired. However, existing control methods based on the centralized or the decentralized scheme have not presented their competencies in considering the optimality and the scalability simultaneously. To address this issue, we propose an integrated control method based on the framework of Decentralized Autonomous Organization (DAO). The proposed method achieves a global consensus on energy consumption efficiency (ECE), meanwhile to optimize the local objectives of all involved intelligent agents, through a consensus and incentive mechanism. Furthermore, an operation algorithm is proposed regarding the issue of structural rigidity in DAO. Specifically, the proposed operation approach identifies critical agents to execute the smart contract in DAO, which ultimately extends the capability of DAO-based control. In addition, a numerical experiment is designed to examine the performance of the proposed method. The experiment results indicate that the controlled agents can achieve a consensus faster on the global objective with improved local objectives by the proposed method, compare to existing decentralized control methods. In general, the proposed method shows a great potential in developing an integrated control system in the ITS △ Less

Submitted 25 July, 2023; originally announced August 2023.

Comments: 6 pages, 6 figures. To be published in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)

arXiv:2308.02867 [pdf, other]

A Systematic Exploration of Joint-training for Singing Voice Synthesis

Authors: Yuning Wu, Yifeng Yu, Jiatong Shi, Tao Qian, Qin Jin

Abstract: There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar… ▽ More There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework. △ Less

Submitted 5 August, 2023; originally announced August 2023.

arXiv:2308.02166 [pdf, other]

Transformer-Based Denoising of Mechanical Vibration Signals

Authors: Han Chen, Yang Yu, Pengtao Li

Abstract: Mechanical vibration signal denoising is a pivotal task in various industrial applications, including system health monitoring and failure prediction. This paper introduces a novel deep learning transformer-based architecture specifically tailored for denoising mechanical vibration signals. The model leverages a Multi-Head Attention layer with 8 heads, processing input sequences of length 128, emb… ▽ More Mechanical vibration signal denoising is a pivotal task in various industrial applications, including system health monitoring and failure prediction. This paper introduces a novel deep learning transformer-based architecture specifically tailored for denoising mechanical vibration signals. The model leverages a Multi-Head Attention layer with 8 heads, processing input sequences of length 128, embedded into a 64-dimensional space. The architecture also incorporates Feed-Forward Neural Networks, Layer Normalization, and Residual Connections, resulting in enhanced recognition and extraction of essential features. Through a training process guided by the Mean Squared Error loss function and optimized using the Adam optimizer, the model demonstrates remarkable effectiveness in filtering out noise while preserving critical information related to mechanical vibrations. The specific design and choice of parameters offer a robust method adaptable to the complex nature of mechanical systems, with promising applications in industrial monitoring and maintenance. This work lays the groundwork for future exploration and optimization in the field of mechanical signal analysis and presents a significant step towards advanced and intelligent mechanical system diagnostics. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: 8 pages

arXiv:2307.07710 [pdf, other]

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

Authors: Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen

Abstract: Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure… ▽ More Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted. △ Less

Submitted 15 August, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: accepted by ICCV2023

arXiv:2307.05884 [pdf, other]

Learning Koopman Operators with Control Using Bi-level Optimization

Authors: Daning Huang, Muhammad Bayu Prasetyo, Yin Yu, Junyi Geng

Abstract: The accurate modeling and control of nonlinear dynamical effects are crucial for numerous robotic systems. The Koopman formalism emerges as a valuable tool for linear control design in nonlinear systems within unknown environments. However, it still remains a challenging task to learn the Koopman operator with control from data, and in particular, the simultaneous identification of the Koopman lin… ▽ More The accurate modeling and control of nonlinear dynamical effects are crucial for numerous robotic systems. The Koopman formalism emerges as a valuable tool for linear control design in nonlinear systems within unknown environments. However, it still remains a challenging task to learn the Koopman operator with control from data, and in particular, the simultaneous identification of the Koopman linear dynamics and the mapping between the physical and Koopman states. Conventionally, the simultaneous learning of the dynamics and mapping is achieved via single-level optimization based on one-step or multi-step discrete-time predictions, but the learned model may lack model robustness, training efficiency, and/or long-term predictive accuracy. This paper presents a bi-level optimization framework that jointly learns the Koopman embedding mapping and Koopman dynamics with exact long-term dynamical constraints. Our formulation allows back-propagation in standard learning framework and the use of state-of-the-art optimizers, yielding more accurate and stable system prediction in long-time horizon over various applications compared to conventional methods. △ Less

Submitted 5 November, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: Accepted by 2023 IEEE 62nd Conference on Decision and Control (CDC)

arXiv:2306.12058 [pdf, other]

Beyond Learned Metadata-based Raw Image Reconstruction

Authors: Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen

Abstract: While raw images have distinct advantages over sRGB images, e.g., linearity and fine-grained quantization levels, they are not widely adopted by general users due to their substantial storage requirements. Very recent studies propose to compress raw images by designing sampling masks within the pixel space of the raw image. However, these approaches often leave space for pursuing more effective im… ▽ More While raw images have distinct advantages over sRGB images, e.g., linearity and fine-grained quantization levels, they are not widely adopted by general users due to their substantial storage requirements. Very recent studies propose to compress raw images by designing sampling masks within the pixel space of the raw image. However, these approaches often leave space for pursuing more effective image representations and compact metadata. In this work, we propose a novel framework that learns a compact representation in the latent space, serving as metadata, in an end-to-end manner. Compared with lossy image compression, we analyze the intrinsic difference of the raw image reconstruction task caused by rich information from the sRGB image. Based on the analysis, a novel backbone design with asymmetric and hybrid spatial feature resolutions is proposed, which significantly improves the rate-distortion performance. Besides, we propose a novel design of the context model, which can better predict the order masks of encoding/decoding based on both the sRGB image and the masks of already processed features. Benefited from the better modeling of the correlation between order masks, the already processed information can be better utilized. Moreover, a novel sRGB-guided adaptive quantization precision strategy, which dynamically assigns varying levels of quantization precision to different regions, further enhances the representation ability of the model. Finally, based on the iterative properties of the proposed context model, we propose a novel strategy to achieve variable bit rates using a single model. This strategy allows for the continuous convergence of a wide range of bit rates. Extensive experimental results demonstrate that the proposed method can achieve better reconstruction quality with a smaller metadata size. △ Less

Submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.02613 [pdf, other]

Controllable Lyrics-to-Melody Generation

Authors: Zhe Zhang, Yi Yu, Atsuhiro Takasu

Abstract: Lyrics-to-melody generation is an interesting and challenging topic in AI music research field. Due to the difficulty of learning the correlations between lyrics and melody, previous methods suffer from low generation quality and lack of controllability. Controllability of generative models enables human interaction with models to generate desired contents, which is especially important in music g… ▽ More Lyrics-to-melody generation is an interesting and challenging topic in AI music research field. Due to the difficulty of learning the correlations between lyrics and melody, previous methods suffer from low generation quality and lack of controllability. Controllability of generative models enables human interaction with models to generate desired contents, which is especially important in music generation tasks towards human-centered AI that can facilitate musicians in creative activities. To address these issues, we propose a controllable lyrics-to-melody generation network, ConL2M, which is able to generate realistic melodies from lyrics in user-desired musical style. Our work contains three main novelties: 1) To model the dependencies of music attributes cross multiple sequences, inter-branch memory fusion (Memofu) is proposed to enable information flow between multi-branch stacked LSTM architecture; 2) Reference style embedding (RSE) is proposed to improve the quality of generation as well as control the musical style of generated melodies; 3) Sequence-level statistical loss (SeqLoss) is proposed to help the model learn sequence-level features of melodies given lyrics. Verified by evaluation metrics for music quality and controllability, initial study of controllable lyrics-to-melody generation shows better generation quality and the feasibility of interacting with users to generate the melodies in desired musical styles when given lyrics. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.00408 [pdf, other]

Pursuing Equilibrium of Medical Resources via Data Empowerment in Parallel Healthcare System

Authors: Yi Yu, Shengyue Yao, Kexin Wang, Yan Chen, Fei-Yue Wang, Yilun Lin

Abstract: The imbalance between the supply and demand of healthcare resources is a global challenge, which is particularly severe in developing countries. Governments and academic communities have made various efforts to increase healthcare supply and improve resource allocation. However, these efforts often remain passive and inflexible. Alongside these issues, the emergence of the parallel healthcare syst… ▽ More The imbalance between the supply and demand of healthcare resources is a global challenge, which is particularly severe in developing countries. Governments and academic communities have made various efforts to increase healthcare supply and improve resource allocation. However, these efforts often remain passive and inflexible. Alongside these issues, the emergence of the parallel healthcare system has the potential to solve these problems by unlocking the data value. The parallel healthcare system comprises Medicine-Oriented Operating Systems (MOOS), Medicine-Oriented Scenario Engineering (MOSE), and Medicine-Oriented Large Models (MOLMs), which could collect, circulate, and empower data. In this paper, we propose that achieving equilibrium in medical resource allocation is possible through parallel healthcare systems via data empowerment. The supply-demand relationship can be balanced in parallel healthcare systems by (1) increasing the supply provided by digital and robotic doctors in MOOS, (2) identifying individual and potential demands by proactive diagnosis and treatment in MOSE, and (3) improving supply-demand matching using large models in MOLMs. To illustrate the effectiveness of this approach, we present a case study optimizing resource allocation from the perspective of facility accessibility. Results demonstrate that the parallel healthcare system could result in up to 300% improvement in accessibility. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.07110 [pdf, other]

Dynamic Routing in Stochastic Urban Air Mobility Networks: A Markov Decision Process Approach

Authors: Qinshuang Wei, Yue Yu, Ufuk Topcu

Abstract: Urban air mobility (UAM) is an emerging concept in short-range aviation transportation, where the aircraft will take off, land, and charge their batteries at a set of vertistops, and travel only through a set of flight corridors connecting these vertistops. We study the problem of routing an electric aircraft from its origin vertistop to its destination vertistop with the minimal expected total tr… ▽ More Urban air mobility (UAM) is an emerging concept in short-range aviation transportation, where the aircraft will take off, land, and charge their batteries at a set of vertistops, and travel only through a set of flight corridors connecting these vertistops. We study the problem of routing an electric aircraft from its origin vertistop to its destination vertistop with the minimal expected total travel time. We first introduce a UAM network model that accounts for the limited battery capacity of aircraft, stochastic travel times of flight corridors, stochastic queueing delays, and a limited number of battery-charging stations at vertistops. Based on this model, we provide a sufficient condition for the existence of a routing strategy that avoids battery exhaustion. Furthermore, we show how to compute such a strategy by computing the optimal policy in a Markov decision process, a mathematical framework for decision-making in a stochastic dynamic environment. We illustrate our results using a case study with 29 vertistops and 137 flight corridors. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: 8 pages, 3 figures

Showing 1–50 of 231 results for author: Yu, Y