-
LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
Authors:
Juntao Jiang,
Mengmeng Wang,
Huizhong Tian,
Lingbo Cheng,
Yong Liu
Abstract:
Although the progress made by large models in computer vision, optimization challenges, the complexity of transformer models, computational limitations, and the requirements of practical applications call for simpler designs in model architecture for medical image segmentation, especially in mobile medical devices that require lightweight and deployable models with real-time performance. However,…
▽ More
Although the progress made by large models in computer vision, optimization challenges, the complexity of transformer models, computational limitations, and the requirements of practical applications call for simpler designs in model architecture for medical image segmentation, especially in mobile medical devices that require lightweight and deployable models with real-time performance. However, some of the current lightweight models exhibit poor robustness across different datasets, which hinders their broader adoption. This paper proposes a lightweight and vanilla model called LV-UNet, which effectively utilizes pre-trained MobileNetv3-Large models and introduces fusible modules. It can be trained using an improved deep training strategy and switched to deployment mode during inference, reducing both parameter count and computational load. Experiments are conducted on ISIC 2016, BUSI, CVC- ClinicDB, CVC-ColonDB, and Kvair-SEG datasets, achieving better performance compared to the state-of-the-art and classic models.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement
Authors:
Linlin Hu,
Ao Sun,
Shijie Hao,
Richang Hong,
Meng Wang
Abstract:
Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-v…
▽ More
Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-view disparities and enable interaction between the left and right views, leading to improved performance. However, these methods still do not fully exploit the interaction between left and right view information. To address this issue, we propose a model called Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement (SDI-Net). The backbone structure of SDI-Net is two encoder-decoder pairs, which are used to learn the mapping function from low-light images to normal-light images. Among the encoders and the decoders, we design a module named Cross-View Sufficient Interaction Module (CSIM), aiming to fully exploit the correlations between the binocular views via the attention mechanism. The quantitative and visual results on public datasets validate the superiority of our method over other related methods. Ablation studies also demonstrate the effectiveness of the key elements in our model.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Robust Simultaneous Multislice MRI Reconstruction Using Deep Generative Priors
Authors:
Shoujin Huang,
Guanxiong Luo,
Yuwan Wang,
Kexin Yang,
Lingyan Zhang,
Jingzhe Liu,
Hua Guo,
Min Wang,
Mengye Lyu
Abstract:
Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to the complex signal interactions between and within the excited slices. This study presents a robust SMS MRI reconstruction method using deep generative priors. Starting from Gaussian noise, we leverage denoising diffusi…
▽ More
Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to the complex signal interactions between and within the excited slices. This study presents a robust SMS MRI reconstruction method using deep generative priors. Starting from Gaussian noise, we leverage denoising diffusion probabilistic models (DDPM) to gradually recover the individual slices through reverse diffusion iterations while imposing data consistency from the measured k-space under readout concatenation framework. The posterior sampling procedure is designed such that the DDPM training can be performed on single-slice images without special adjustments for SMS tasks. Additionally, our method integrates a low-frequency enhancement (LFE) module to address a practical issue that SMS-accelerated fast spin echo (FSE) and echo-planar imaging (EPI) sequences cannot easily embed autocalibration signals. Extensive experiments demonstrate that our approach consistently outperforms existing methods and generalizes well to unseen datasets. The code is available at https://github.com/Solor-pikachu/ROGER after the review process.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
Pose, Velocity and Landmark Position Estimation Using IMU and Bearing Measurements
Authors:
Miaomiao Wang,
Abdelhamid Tayebi
Abstract:
This paper investigates the estimation problem of the pose (orientation and position) and linear velocity of a rigid body, as well as the landmark positions, using an inertial measurement unit (IMU) and a monocular camera. First, we propose a globally exponentially stable (GES) linear time-varying (LTV) observer for the estimation of body-frame landmark positions and velocity, using IMU and monocu…
▽ More
This paper investigates the estimation problem of the pose (orientation and position) and linear velocity of a rigid body, as well as the landmark positions, using an inertial measurement unit (IMU) and a monocular camera. First, we propose a globally exponentially stable (GES) linear time-varying (LTV) observer for the estimation of body-frame landmark positions and velocity, using IMU and monocular bearing measurements. Thereafter, using the gyro measurements, some landmarks known in the inertial frame and the estimates from the LTV observer, we propose a nonlinear pose observer on $\SO(3)\times \mathbb{R}^3$. The overall estimation system is shown to be almost globally asymptotically stable (AGAS) using the notion of almost global input-to-state stability (ISS). Interestingly, we show that with the knowledge (in the inertial frame) of a small number of landmarks, we can recover (under some conditions) the unknown positions (in the inertial frame) of a large number of landmarks. Numerical simulation results are presented to illustrate the performance of the proposed estimation scheme.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Interleaved Block-Sparse Transform
Authors:
Lei Liu,
Ming Wang,
Shufeng Li,
Yuhao Chi,
Ning Wei,
ZhaoYang Zhang
Abstract:
Low-complexity Bayes-optimal memory approximate message passing (MAMP) is an efficient signal estimation algorithm in compressed sensing and multicarrier modulation. However, achieving replica Bayes optimality with MAMP necessitates a large-scale right-unitarily invariant transformation, which is prohibitive in practical systems due to its high computational complexity and hardware costs. To solve…
▽ More
Low-complexity Bayes-optimal memory approximate message passing (MAMP) is an efficient signal estimation algorithm in compressed sensing and multicarrier modulation. However, achieving replica Bayes optimality with MAMP necessitates a large-scale right-unitarily invariant transformation, which is prohibitive in practical systems due to its high computational complexity and hardware costs. To solve this difficulty, this letter proposes a low-complexity interleaved block-sparse (IBS) transform, which consists of interleaved multiple low-dimensional transform matrices, aimed at reducing the hardware implementation scale while mitigating performance loss. Furthermore, an IBS cross-domain memory approximate message passing (IBS-CD-MAMP) estimator is developed, comprising a memory linear estimator in the IBS transform domain and a non-linear estimator in the source domain. Numerical results show that the IBS-CD-MAMP offers a reduced implementation scale and lower complexity with excellent performance in IBS-based compressed sensing and interleave frequency division multiplexing systems.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Reconfigurable-Intelligent-Surface Assisted Orbital-Angular-Momentum Secure Communications
Authors:
Minmin Wang,
Liping Liang,
Wenchi Cheng,
Wei Zhang,
Ruirui Chen,
Hailin Zhang
Abstract:
As a kind of wavefront with helical phase, orbital angular momentum (OAM) shows the great potential to enhance the security results of wireless communications due to its unique orthogonality and central hollow electromagnetic wave structure. Therefore, in this paper we propose the reconfigurable-intelligent-surface (RIS) assisted OAM scheme, where RIS is deployed to weaken the information acquisit…
▽ More
As a kind of wavefront with helical phase, orbital angular momentum (OAM) shows the great potential to enhance the security results of wireless communications due to its unique orthogonality and central hollow electromagnetic wave structure. Therefore, in this paper we propose the reconfigurable-intelligent-surface (RIS) assisted OAM scheme, where RIS is deployed to weaken the information acquisition at eavesdroppers by adjusting the OAM beams pointed to the eavesdropper and artificial noise (AN) is applied to interfere with the eavesdropper, thus significantly increasing the secrecy rates of short-range secure communications. Aiming at obtaining the maximum secrecy rate, we develop the Riemannian manifold conjugate gradient (RMCG) based alternative optimization (AO) algorithm to assign much power to low-order OAM-modes and optimize the OAM beams direction with the programmable RIS, thus respectively enhancing and weakening the received signal strength at the legitimate receiver and the eavesdropper. Numerical results show that our proposed scheme outperforms the existing works in terms of the secrecy rate and the eavesdropper's bit error rate.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
Authors:
Yu Tian,
Congcong Wen,
Min Shi,
Muhammad Muneeb Afzal,
Hao Huang,
Muhammad Osama Khan,
Yan Luo,
Yi Fang,
Mengyu Wang
Abstract:
Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., di…
▽ More
Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., different retinal imaging modalities) for patient diagnosis. This paper presents FairDomain, a pioneering systemic study into algorithmic fairness under domain shifts, employing state-of-the-art domain adaptation (DA) and generalization (DG) algorithms for both medical segmentation and classification tasks to understand how biases are transferred between different domains. We also introduce a novel plug-and-play fair identity attention (FIA) module that adapts to various DA and DG algorithms to improve fairness by using self-attention to adjust feature importance based on demographic attributes. Additionally, we curate the first fairness-focused dataset with two paired imaging modalities for the same patient cohort on medical segmentation and classification tasks, to rigorously assess fairness in domain-shift scenarios. Excluding the confounding impact of demographic distribution variation between source and target domains will allow clearer quantification of the performance of domain transfer models. Our extensive evaluations reveal that the proposed FIA significantly enhances both model performance accounted for fairness across all domain shift settings (i.e., DA and DG) with respect to different demographics, which outperforms existing methods on both segmentation and classification. The code and data can be accessed at https://ophai.hms.harvard.edu/datasets/harvard-fairdomain20k.
△ Less
Submitted 18 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Exploiting Scale-Variant Attention for Segmenting Small Medical Objects
Authors:
Wei Dai,
Rui Liu,
Zixuan Wu,
Tianyi Wu,
Min Wang,
Junxian Zhou,
Yixuan Yuan,
Jun Liu
Abstract:
Early detection and accurate diagnosis can predict the risk of malignant disease transformation, thereby increasing the probability of effective treatment. Identifying mild syndrome with small pathological regions serves as an ominous warning and is fundamental in the early diagnosis of diseases. While deep learning algorithms, particularly convolutional neural networks (CNNs), have shown promise…
▽ More
Early detection and accurate diagnosis can predict the risk of malignant disease transformation, thereby increasing the probability of effective treatment. Identifying mild syndrome with small pathological regions serves as an ominous warning and is fundamental in the early diagnosis of diseases. While deep learning algorithms, particularly convolutional neural networks (CNNs), have shown promise in segmenting medical objects, analyzing small areas in medical images remains challenging. This difficulty arises due to information losses and compression defects from convolution and pooling operations in CNNs, which become more pronounced as the network deepens, especially for small medical objects. To address these challenges, we propose a novel scale-variant attention-based network (SvANet) for accurately segmenting small-scale objects in medical images. The SvANet consists of scale-variant attention, cross-scale guidance, Monte Carlo attention, and vision transformer, which incorporates cross-scale features and alleviates compression artifacts for enhancing the discrimination of small medical objects. Quantitative experimental results demonstrate the superior performance of SvANet, achieving 96.12%, 96.11%, 89.79%, 84.15%, 80.25%, 73.05%, and 72.58% in mean Dice coefficient for segmenting kidney tumors, skin lesions, hepatic tumors, polyps, surgical excision cells, retinal vasculatures, and sperms, which occupy less than 1% of the image areas in KiTS23, ISIC 2018, ATLAS, PolypGen, TissueNet, FIVES, and SpermHealth datasets, respectively.
△ Less
Submitted 5 August, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Electrical Impedance Tomography Based Closed-loop Tumor Treating Fields in Dynamic Lung Tumors
Authors:
Minmin Wang,
Xu Xie,
Yuxi Guo,
Liying Zhu,
Yue Lan,
Haitang Yang,
Yun Pan,
Guangdi Chen,
Shaomin Zhang,
Maomao Zhang
Abstract:
Tumor Treating Fields (TTFields) is a non-invasive anticancer modality that utilizes alternating electric fields to disrupt cancer cell division and growth. While generally well-tolerated with minimal side effects, traditional TTFields therapy for lung tumors faces challenges due to the influence of respiratory motion. We design a novel closed-loop TTFields strategy for lung tumors by incorporatin…
▽ More
Tumor Treating Fields (TTFields) is a non-invasive anticancer modality that utilizes alternating electric fields to disrupt cancer cell division and growth. While generally well-tolerated with minimal side effects, traditional TTFields therapy for lung tumors faces challenges due to the influence of respiratory motion. We design a novel closed-loop TTFields strategy for lung tumors by incorporating electrical impedance tomography (EIT) for real-time respiratory phase monitoring and dynamic parameter adjustments. Furthermore, we conduct theoretical analysis to evaluate the performance of the proposed method using the lung motion model. Compared to conventional TTFields settings, we observed that variations in the electrical conductivity of lung during different respiratory phases led to a decrease in the average electric field intensity within lung tumors, transitioning from end-expiratory (1.08 V/cm) to end-inspiratory (0.87 V/cm) phases. Utilizing our proposed closed-Loop TTFields approach at the same dose setting (2400 mA, consistent with the traditional TTFields setting), we can achieve a higher and consistent average electric field strength at the tumor site (1.30 V/cm) across different respiratory stages. Our proposed closed-loop TTFields method has the potential to improved lung tumor therapy by mitigating the impact of respiratory motion.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Machine-Type Communication Waveforms: An Exploration of New Dimensions
Authors:
Michael Wang,
Lei Wang,
Xiaohu You
Abstract:
This paper derives a generalized class of waveforms with an application to machine-type communication (MTC) while studying its underlying structural characteristics in relation to conventional modulation waveforms. First, a canonical waveform of frequency-error tolerance is identified for a unified preamble and traffic signal design, ideal for MTC use as a composite waveform, commonly known as a t…
▽ More
This paper derives a generalized class of waveforms with an application to machine-type communication (MTC) while studying its underlying structural characteristics in relation to conventional modulation waveforms. First, a canonical waveform of frequency-error tolerance is identified for a unified preamble and traffic signal design, ideal for MTC use as a composite waveform, commonly known as a transmission burst. It is shown that the most widely used modulation schemes for mIoT traffic signals, e.g., FSK and LoRa modulation, are simply subsets of the canonical waveform. The intrinsic characteristics and degrees of freedom the waveform offers are then explored. Most significantly, a new waveform dimension is uncovered and exploited as additional degrees of freedom for satisfying the MTC requirements, i.e., energy and resource efficiency and robustness. The corresponding benefits are evaluated analytically and numerically in AWGN, frequency-flat, and selective channels. We demonstrate that neither FSK nor LoRa can fully address the mIoT requirements since neither fully exploits the degrees of freedom from the perspective of the generalized waveform class. Finally, a solution is devised to optimize energy and resource efficiency under various deployment environments and practical constraints while maintaining the low-complexity property.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Multi-service collaboration and composition of cloud manufacturing customized production based on problem decomposition
Authors:
Hao Yue,
Yingtao Wu,
Min Wang,
Hesuan Hu,
Weimin Wu,
Jihui Zhang
Abstract:
Cloud manufacturing system is a service-oriented and knowledge-based one, which can provide solutions for the large-scale customized production. The service resource allocation is the primary factor that restricts the production time and cost in the cloud manufacturing customized production (CMCP). In order to improve the efficiency and reduce the cost in CMCP, we propose a new framework which con…
▽ More
Cloud manufacturing system is a service-oriented and knowledge-based one, which can provide solutions for the large-scale customized production. The service resource allocation is the primary factor that restricts the production time and cost in the cloud manufacturing customized production (CMCP). In order to improve the efficiency and reduce the cost in CMCP, we propose a new framework which considers the collaboration among services with the same functionality. A mathematical evaluation formulation for the service composition and service usage scheme is constructed with the following critical indexes: completion time, cost, and number of selected services. Subsequently, a problem decomposition based genetic algorithm is designed to obtain the optimal service compositions with service usage schemes. A smart clothing customization case is illustrated so as to show the effectiveness and efficiency of the method proposed in this paper. Finally, the results of simulation experiments and comparisons show that these solutions obtained by our method are with the minimum time, a lower cost, and the fewer selected services.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Subtractive Training for Music Stem Insertion using Latent Diffusion Models
Authors:
Ivan Villa-Renteria,
Mason L. Wang,
Zachary Shah,
Zhe Li,
Soohyun Kim,
Neelesh Ramachandran,
Mert Pilanci
Abstract:
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusi…
▽ More
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images
Authors:
Yuanyuan Peng,
Aidi Lin,
Meng Wang,
Tian Lin,
Ke Zou,
Yinglin Cheng,
Tingkun Shi,
Xulong Liao,
Lixia Feng,
Zhen Liang,
Xinjian Chen,
Huazhu Fu,
Haoyu Chen
Abstract:
Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RE…
▽ More
Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. In the external test sets obtained from other OCT devices, FMUE achieved an accuracy of 88.75% and 92.73% before and after thresholding. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% &71.72%). Besides, our model correctly predicts high uncertainty scores for samples with ambiguous features, of non-target-category diseases, or with low-quality to prompt manual checks and prevent misdiagnosis. FMUE provides a trustworthy method for automatic retinal anomalies detection in the real-world clinical open set environment.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Authors:
Meng Wang,
Tian Lin,
Aidi Lin,
Kai Yu,
Yuanyuan Peng,
Lianyu Wang,
Cheng Chen,
Ke Zou,
Huiyu Liang,
Man Chen,
Xue Yao,
Meiqin Zhang,
Binwei Huang,
Chaoxin Zheng,
Peixin Zhang,
Wei Chen,
Yilong Luo,
Yifan Chen,
Honghe Xia,
Tingkun Shi,
Qi Zhang,
Jinming Guo,
Xiaolin Chen,
Jingcheng Wang,
Yih Chung Tham
, et al. (24 additional authors not shown)
Abstract:
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources…
▽ More
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered.
△ Less
Submitted 30 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Traffic Signal Cycle Control with Centralized Critic and Decentralized Actors under Varying Intervention Frequencies
Authors:
Maonan Wang,
Yirong Chen,
Yuheng Kan,
Chengcheng Xu,
Michael Lepech,
Man-On Pun,
Xi Xiong
Abstract:
Traffic congestion in urban areas is a significant problem, leading to prolonged travel times, reduced efficiency, and increased environmental concerns. Effective traffic signal control (TSC) is a key strategy for reducing congestion. Unlike most TSC systems that rely on high-frequency control, this study introduces an innovative joint phase traffic signal cycle control method that operates effect…
▽ More
Traffic congestion in urban areas is a significant problem, leading to prolonged travel times, reduced efficiency, and increased environmental concerns. Effective traffic signal control (TSC) is a key strategy for reducing congestion. Unlike most TSC systems that rely on high-frequency control, this study introduces an innovative joint phase traffic signal cycle control method that operates effectively with varying control intervals. Our method features an adjust all phases action design, enabling simultaneous phase changes within the signal cycle, which fosters both immediate stability and sustained TSC effectiveness, especially at lower frequencies. The approach also integrates decentralized actors to handle the complexity of the action space, with a centralized critic to ensure coordinated phase adjusting. Extensive testing on both synthetic and real-world data across different intersection types and signal setups shows that our method significantly outperforms other popular techniques, particularly at high control intervals. Case studies of policies derived from traffic data further illustrate the robustness and reliability of our proposed method.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Progress Towards Decoding Visual Imagery via fNIRS
Authors:
Michel Adamic,
Wellington Avelino,
Anna Brandenberger,
Bryan Chiang,
Hunter Davis,
Stephen Fay,
Andrew Gregory,
Aayush Gupta,
Raphael Hotter,
Grace Jiang,
Fiona Leng,
Stephen Polcyn,
Thomas Ribeiro,
Paul Scotti,
Michelle Wang,
Marley Xiong,
Jonathan Xu
Abstract:
We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 2…
▽ More
We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 20% with 2-cm resolution. With simulations and high-density tomography, we found that time-domain fNIRS can achieve 1-cm resolution, compared to 2-cm resolution for continuous-wave fNIRS. Lastly, we share designs for a prototype time-domain fNIRS device, consisting of a laser driver, a single photon detector, and a time-to-digital converter system.
△ Less
Submitted 22 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Hearing Anything Anywhere
Authors:
Mason Wang,
Ryosuke Sawata,
Samuel Clarke,
Ruohan Gao,
Shangzhe Wu,
Jiajun Wu
Abstract:
Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic…
▽ More
Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Double-RIS-Assisted Orbital Angular Momentum Near-Field Secure Communications
Authors:
Liping Liang,
Minmin Wang,
Wenchi Cheng,
Wei Zhang
Abstract:
To satisfy the various demands of growing devices and services, emerging high-frequency-based technologies promote near-field wireless communications. Therefore, near-field physical layer security has attracted much attention to facilitate the wireless information security against illegitimate eavesdropping. However, highly correlated channels between legitimate transceivers and eavesdroppers of e…
▽ More
To satisfy the various demands of growing devices and services, emerging high-frequency-based technologies promote near-field wireless communications. Therefore, near-field physical layer security has attracted much attention to facilitate the wireless information security against illegitimate eavesdropping. However, highly correlated channels between legitimate transceivers and eavesdroppers of existing multiple-input multiple-output (MIMO) based near-field secure technologies along with the low degrees of freedom significantly limit the enhancement of security results in wireless communications. To significantly increase the secrecy rates of near-field wireless communications, in this paper we propose the double-reconfigurable-intelligent-surface (RIS) assisted orbital angular momentum (OAM) secure scheme, where RISs with few reflecting elements are easily deployed to reconstruct the direct links blocked by obstacles between the legitimate transceivers, mitigate the inter-mode interference caused by the misalignment of legitimate transceivers, and adjust the OAM beams direction to interfere with eavesdroppers. Meanwhile, due to the unique orthogonality among OAM modes, the OAM-based joint index modulation and artificial noise scheme is proposed to weaken the information acquisition by eavesdroppers while increasing the achievable rate with the low cost of legitimate communications. To maximize the secrecy rate of our proposed scheme, we develop the Riemannian manifold conjugate gradient (RMCG)-based alternative optimization (AO) algorithm to jointly optimize the transmit power allocation of OAM modes and phase shifts of double RISs. Numerical results show that our proposed double-RIS-assisted OAM near-field secure scheme outperforms the existing works in terms of the secrecy rate and the eavesdropper's bit error rate.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Fractal OAM Generation and Detection Schemes
Authors:
Runyu Lyu,
Wenchi Cheng,
Muyao Wang,
Wei Zhang
Abstract:
Orbital angular momentum (OAM) carried electromagnetic waves have the potential to improve spectrum efficiency in optical and radio-frequency communications due to the orthogonal wavefronts of different OAM modes. However, OAM beams are vortically hollow and divergent, which significantly decreases the capacity of OAM transmissions. In addition, unaligned transceivers in OAM transmissions can resu…
▽ More
Orbital angular momentum (OAM) carried electromagnetic waves have the potential to improve spectrum efficiency in optical and radio-frequency communications due to the orthogonal wavefronts of different OAM modes. However, OAM beams are vortically hollow and divergent, which significantly decreases the capacity of OAM transmissions. In addition, unaligned transceivers in OAM transmissions can result in a high bit error rate (BER). The Talbot effect is a self-imaging phenomenon that can be used to generate optical or radio-frequency OAM beams with periodic repeating structures at multiples of a certain distance along the propagation direction. These periodic structures make it unnecessary for the transceiver antennas to be perfectly aligned and can also alleviate the hollow divergence of OAM beams. In this paper, we propose Talbot-effect-based fractal OAM generation and detection schemes using a uniform circular array (UCA) to significantly improve capacity and BER performance in unaligned OAM transmissions. We first provide a brief overview of fractal OAM. Then, we propose the fractal OAM beam generation and detection schemes. Numerical analysis and simulations verify the effectiveness of our proposed fractal OAM generation scheme and also demonstrate improved capacity and BER performance compared to normal OAM transmissions. We also analyze how the receive UCA radius and the distance between the UCAs impact the capacity and BER performances.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Authors:
Shijian Deng,
Erin E. Kosloski,
Siddhi Patel,
Zeke A. Barnett,
Yiyang Nan,
Alexander Kaplan,
Sisira Aarukapalli,
William T. Doan,
Matthew Wang,
Harsh Singh,
Pamela R. Rollins,
Yapeng Tian
Abstract:
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-rel…
▽ More
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.
△ Less
Submitted 22 March, 2024;
originally announced June 2024.
-
Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaption
Authors:
Anqi Li,
Yuxi Liu,
Huihui Bai,
Feng Li,
Runmin Cong,
Meng Wang,
Yao Zhao
Abstract:
Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, Control-GIC, the first capable of…
▽ More
Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, Control-GIC, the first capable of fine-grained bitrate adaption across a broad spectrum while ensuring high-fidelity and generality compression. We base Control-GIC on a VQGAN framework representing an image as a sequence of variable-length codes (i.e. VQ-indices), which can be losslessly compressed and exhibits a direct positive correlation with the bitrates. Therefore, drawing inspiration from the classical coding principle, we naturally correlate the information density of local image patches with their granular representations, to achieve dynamic adjustment of the code quantity following different granularity decisions. This implies we can flexibly determine a proper allocation of granularity for the patches to acquire desirable compression rates. We further develop a probabilistic conditional decoder that can trace back to historic encoded multi-granularity representations according to transmitted codes, and then reconstruct hierarchical granular features in the formalization of conditional probability, enabling more informative aggregation to improve reconstruction realism. Our experiments show that Control-GIC allows highly flexible and controllable bitrate adaption and even once compression on an entire dataset to fulfill constrained bitrate conditions. Experimental results demonstrate its superior performance over recent state-of-the-art methods.
△ Less
Submitted 5 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Confidence-aware multi-modality learning for eye disease screening
Authors:
Ke Zou,
Tian Lin,
Zongbo Han,
Meng Wang,
Xuedong Yuan,
Haoyu Chen,
Changqing Zhang,
Xiaojing Shen,
Huazhu Fu
Abstract:
Multi-modal ophthalmic image classification plays a key role in diagnosing eye diseases, as it integrates information from different sources to complement their respective performances. However, recent improvements have mainly focused on accuracy, often neglecting the importance of confidence and robustness in predictions for diverse modalities. In this study, we propose a novel multi-modality evi…
▽ More
Multi-modal ophthalmic image classification plays a key role in diagnosing eye diseases, as it integrates information from different sources to complement their respective performances. However, recent improvements have mainly focused on accuracy, often neglecting the importance of confidence and robustness in predictions for diverse modalities. In this study, we propose a novel multi-modality evidential fusion pipeline for eye disease screening. It provides a measure of confidence for each modality and elegantly integrates the multi-modality information using a multi-distribution fusion perspective. Specifically, our method first utilizes normal inverse gamma prior distributions over pre-trained models to learn both aleatoric and epistemic uncertainty for uni-modality. Then, the normal inverse gamma distribution is analyzed as the Student's t distribution. Furthermore, within a confidence-aware fusion framework, we propose a mixture of Student's t distributions to effectively integrate different modalities, imparting the model with heavy-tailed properties and enhancing its robustness and reliability. More importantly, the confidence-aware multi-modality ranking regularization term induces the model to more reasonably rank the noisy single-modal and fused-modal confidence, leading to improved reliability and accuracy. Experimental results on both public and internal datasets demonstrate that our model excels in robustness, particularly in challenging scenarios involving Gaussian noise and modality missing conditions. Moreover, our model exhibits strong generalization capabilities to out-of-distribution data, underscoring its potential as a promising solution for multimodal eye disease screening.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices
Authors:
Zhengchuan Chen,
Kang Lang,
Nikolaos Pappas,
Howard H. Yang,
Min Wang,
Zhong Tian,
Tony Q. S. Quek
Abstract:
Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Informat…
▽ More
Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Information (AoI) of a multi-queue status update system where multiple devices monitor the same physical process. Specifically, two systems are considered: the Basic System, which only has type-1 devices that are ad hoc devices located close to the source, and the Hybrid System, which contains additional type-2 devices that are infrastructure-based devices located in fixed points compared to the Basic System. Using the Stochastic Hybrid Systems (SHS) framework, a mathematical model that combines discrete and continuous dynamics, we derive the expressions of the average AoI of the considered two systems in closed form. Numerical results verify the accuracy of the analysis. It is shown that when the number and parameters of the type-1 devices/type-2 devices are fixed, the logarithm of average AoI will linearly decrease with the logarithm of the total arrival rate of type-2 devices or that of the number of type-1 devices under specific condition. It has also been demonstrated that the proposed systems can significantly outperform the FCFS M/M/N status update system.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
A Disease Labeler for Chinese Chest X-Ray Report Generation
Authors:
Mengwei Wang,
Ruixin Yan,
Zeyi Hou,
Ning Lang,
Xiuzhuang Zhou
Abstract:
In the field of medical image analysis, the scarcity of Chinese chest X-ray report datasets has hindered the development of technology for generating Chinese chest X-ray reports. On one hand, the construction of a Chinese chest X-ray report dataset is limited by the time-consuming and costly process of accurate expert disease annotation. On the other hand, a single natural language generation metr…
▽ More
In the field of medical image analysis, the scarcity of Chinese chest X-ray report datasets has hindered the development of technology for generating Chinese chest X-ray reports. On one hand, the construction of a Chinese chest X-ray report dataset is limited by the time-consuming and costly process of accurate expert disease annotation. On the other hand, a single natural language generation metric is commonly used to evaluate the similarity between generated and ground-truth reports, while the clinical accuracy and effectiveness of the generated reports rely on an accurate disease labeler (classifier). To address the issues, this study proposes a disease labeler tailored for the generation of Chinese chest X-ray reports. This labeler leverages a dual BERT architecture to handle diagnostic reports and clinical information separately and constructs a hierarchical label learning algorithm based on the affiliation between diseases and body parts to enhance text classification performance. Utilizing this disease labeler, a Chinese chest X-ray report dataset comprising 51,262 report samples was established. Finally, experiments and analyses were conducted on a subset of expert-annotated Chinese chest X-ray reports, validating the effectiveness of the proposed disease labeler.
△ Less
Submitted 18 March, 2024;
originally announced April 2024.
-
Autoencoder-assisted Feature Ensemble Net for Incipient Faults
Authors:
Mingxuan Gao,
Min Wang,
Maoyin Chen
Abstract:
Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to…
▽ More
Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to detect, lacking effective detection technologies in the field of fault detection. In this work, we propose Autoencoder-assisted Feature Ensemble Net (AE-FENet): a deep feature ensemble framework that uses the unsupervised autoencoder to conduct the feature transformation. Compared with the principle component analysis (PCA) technique adopted in the original Feature Ensemble Net (FENet), autoencoder can mine more exact features on incipient faults, which results in the better detection performance of AE-FENet. With same kinds of basic detectors, AE-FENet achieves a state-of-the-art average accuracy over 96% on faults 3, 9 and 15 in TEP, which represents a significant enhancement in performance compared to other methods. Plenty of experiments have been done to extend our framework, proving that DLNs can be utilized efficiently within this architecture.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Yawei Li,
Nancy Mehta,
Radu Timofte,
Hongyuan Yu,
Cheng Wan,
Yuxin Hong,
Bingnan Han,
Zhuoyuan Wu,
Yajun Zou,
Yuqing Liu,
Jizhe Li,
Keji He,
Chao Fan,
Heng Zhang,
Xiaolin Zhang,
Xuanwu Yin,
Kunlong Zuo,
Bohao Liao,
Peizhe Xia,
Long Peng,
Zhibo Du,
Xin Di,
Wangkai Li,
Yang Wang
, et al. (109 additional authors not shown)
Abstract:
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such…
▽ More
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
△ Less
Submitted 25 June, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Distributed Federated Learning-Based Deep Learning Model for Privacy MRI Brain Tumor Detection
Authors:
Lisang Zhou,
Meng Wang,
Ning Zhou
Abstract:
Distributed training can facilitate the processing of large medical image datasets, and improve the accuracy and efficiency of disease diagnosis while protecting patient privacy, which is crucial for achieving efficient medical image analysis and accelerating medical research progress. This paper presents an innovative approach to medical image classification, leveraging Federated Learning (FL) to…
▽ More
Distributed training can facilitate the processing of large medical image datasets, and improve the accuracy and efficiency of disease diagnosis while protecting patient privacy, which is crucial for achieving efficient medical image analysis and accelerating medical research progress. This paper presents an innovative approach to medical image classification, leveraging Federated Learning (FL) to address the dual challenges of data privacy and efficient disease diagnosis. Traditional Centralized Machine Learning models, despite their widespread use in medical imaging for tasks such as disease diagnosis, raise significant privacy concerns due to the sensitive nature of patient data. As an alternative, FL emerges as a promising solution by allowing the training of a collective global model across local clients without centralizing the data, thus preserving privacy. Focusing on the application of FL in Magnetic Resonance Imaging (MRI) brain tumor detection, this study demonstrates the effectiveness of the Federated Learning framework coupled with EfficientNet-B0 and the FedAvg algorithm in enhancing both privacy and diagnostic accuracy. Through a meticulous selection of preprocessing methods, algorithms, and hyperparameters, and a comparative analysis of various Convolutional Neural Network (CNN) architectures, the research uncovers optimal strategies for image classification. The experimental results reveal that EfficientNet-B0 outperforms other models like ResNet in handling data heterogeneity and achieving higher accuracy and lower loss, highlighting the potential of FL in overcoming the limitations of traditional models. The study underscores the significance of addressing data heterogeneity and proposes further research directions for broadening the applicability of FL in medical image analysis.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling
Authors:
Quanxiu Wang,
Hui Huang,
Mingjie Wang,
Yong Dai,
Jinzuomu Zhong,
Benlai Tang
Abstract:
Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues…
▽ More
Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Damage identification of offshore jacket platforms in a digital twin framework considering optimal sensor placement
Authors:
Mengmeng Wang,
Atilla Incecik,
Shizhe Feng,
M. K. Gupta,
Grzegorz Krlolczyk,
Z Li
Abstract:
A new digital twin (DT) framework with optimal sensor placement (OSP) is proposed to accurately calculate the modal responses and identify the damage ratios of the offshore jacket platforms. The proposed damage identification framework consists of two models (namely one OSP model and one damage identification model). The OSP model adopts the multi-objective Lichtenberg algorithm (MOLA) to perform…
▽ More
A new digital twin (DT) framework with optimal sensor placement (OSP) is proposed to accurately calculate the modal responses and identify the damage ratios of the offshore jacket platforms. The proposed damage identification framework consists of two models (namely one OSP model and one damage identification model). The OSP model adopts the multi-objective Lichtenberg algorithm (MOLA) to perform the sensor number/location optimization to make a good balance between the sensor cost and the modal calculation accuracy. In the damage identification model, the Markov Chain Monte Carlo (MCMC)-Bayesian method is developed to calculate the structural damage ratios based on the modal information obtained from the sensory measurements, where the uncertainties of the structural parameters are quantified. The proposed method is validated using an offshore jacket platform, and the analysis results demonstrate efficient identification of the structural damage location and severity.
△ Less
Submitted 26 March, 2024;
originally announced April 2024.
-
Sound event localization and classification using WASN in Outdoor Environment
Authors:
Dongzhe Zhang,
Jianfeng Chen,
Jisheng Bai,
Mou Wang
Abstract:
Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microp…
▽ More
Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microphone arrays often focus solely on source localization, neglecting the aspect of sound event classification. In this paper, we propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source. We introduce a Soundmap feature to capture spatial information across multiple frequency bands. We also use the Gammatone filter to generate acoustic features more suitable for outdoor environments. Furthermore, we integrate attention mechanisms to learn channel-wise relationships and temporal dependencies within the acoustic features. To evaluate our proposed method, we conduct experiments using simulated datasets with different levels of noise and size of monitoring areas, as well as different arrays and source positions. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods in both sound event classification and sound source localization tasks. And we provide further analysis to explain the reasons for the observed errors.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images
Authors:
Meilin Wang,
Yexing Song,
Pengxu Wei,
Xiaoyu Xian,
Yukai Shi,
Liang Lin
Abstract:
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of…
▽ More
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban Environments
Authors:
Maonan Wang,
Aoyu Pang,
Yuheng Kan,
Man-On Pun,
Chung Shue Chen,
Bo Huang
Abstract:
Traffic congestion in metropolitan areas presents a formidable challenge with far-reaching economic, environmental, and societal ramifications. Therefore, effective congestion management is imperative, with traffic signal control (TSC) systems being pivotal in this endeavor. Conventional TSC systems, designed upon rule-based algorithms or reinforcement learning (RL), frequently exhibit deficiencie…
▽ More
Traffic congestion in metropolitan areas presents a formidable challenge with far-reaching economic, environmental, and societal ramifications. Therefore, effective congestion management is imperative, with traffic signal control (TSC) systems being pivotal in this endeavor. Conventional TSC systems, designed upon rule-based algorithms or reinforcement learning (RL), frequently exhibit deficiencies in managing the complexities and variabilities of urban traffic flows, constrained by their limited capacity for adaptation to unfamiliar scenarios. In response to these limitations, this work introduces an innovative approach that integrates Large Language Models (LLMs) into TSC, harnessing their advanced reasoning and decision-making faculties. Specifically, a hybrid framework that augments LLMs with a suite of perception and decision-making tools is proposed, facilitating the interrogation of both the static and dynamic traffic information. This design places the LLM at the center of the decision-making process, combining external traffic data with established TSC methods. Moreover, a simulation platform is developed to corroborate the efficacy of the proposed framework. The findings from our simulations attest to the system's adeptness in adjusting to a multiplicity of traffic environments without the need for additional training. Notably, in cases of Sensor Outage (SO), our approach surpasses conventional RL-based systems by reducing the average waiting time by $20.4\%$. This research signifies a notable advance in TSC strategies and paves the way for the integration of LLMs into real-world, dynamic scenarios, highlighting their potential to revolutionize traffic management. The related code is available at https://github.com/Traffic-Alpha/LLM-Assisted-Light.
△ Less
Submitted 12 June, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Asymmetric Feature Fusion for Image Retrieval
Authors:
Hui Wu,
Min Wang,
Wengang Zhou,
Zhenbo Lu,
Houqiang Li
Abstract:
In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the limited capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradig…
▽ More
In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the limited capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradigm, which advances existing asymmetric retrieval systems by considering the complementarity among different features just at the gallery side. Specifically, it first embeds each gallery image into various features, e.g., local features and global features. Then, a dynamic mixer is introduced to aggregate these features into compact embedding for efficient search. On the query side, only a single lightweight model is deployed for feature extraction. The query model and dynamic mixer are jointly trained by sharing a momentum-updated classifier. Notably, the proposed paradigm boosts the accuracy of asymmetric retrieval without introducing any extra overhead to the query side. Exhaustive experiments on various landmark retrieval datasets demonstrate the superiority of our paradigm.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Structure Similarity Preservation Learning for Asymmetric Image Retrieval
Authors:
Hui Wu,
Min Wang,
Wengang Zhou,
Houqiang Li
Abstract:
Asymmetric image retrieval is a task that seeks to balance retrieval accuracy and efficiency by leveraging lightweight and large models for the query and gallery sides, respectively. The key to asymmetric image retrieval is realizing feature compatibility between different models. Despite the great progress, most existing approaches either rely on classifiers inherited from gallery models or simpl…
▽ More
Asymmetric image retrieval is a task that seeks to balance retrieval accuracy and efficiency by leveraging lightweight and large models for the query and gallery sides, respectively. The key to asymmetric image retrieval is realizing feature compatibility between different models. Despite the great progress, most existing approaches either rely on classifiers inherited from gallery models or simply impose constraints at the instance level, ignoring the structure of embedding space. In this work, we propose a simple yet effective structure similarity preserving method to achieve feature compatibility between query and gallery models. Specifically, we first train a product quantizer offline with the image features embedded by the gallery model. The centroid vectors in the quantizer serve as anchor points in the embedding space of the gallery model to characterize its structure. During the training of the query model, anchor points are shared by the query and gallery models. The relationships between image features and centroid vectors are considered as structure similarities and constrained to be consistent. Moreover, our approach makes no assumption about the existence of any labeled training data and thus can be extended to an unlimited amount of data. Comprehensive experiments on large-scale landmark retrieval demonstrate the effectiveness of our approach. Our code is released at: https://github.com/MCC-WH/SSP.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion
Authors:
Liumeng Xue,
Chaoren Wang,
Mingxuan Wang,
Xueyao Zhang,
Jun Han,
Zhizheng Wu
Abstract:
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also fac…
▽ More
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Training-free image style alignment for self-adapting domain shift on handheld ultrasound devices
Authors:
Hongye Zeng,
Ke Zou,
Zhihao Chen,
Yuchong Gao,
Hongbo Chen,
Haibin Zhang,
Kang Zhou,
Meng Wang,
Rick Siow Mong Goh,
Yong Liu,
Chang Jiang,
Rui Zheng,
Huazhu Fu
Abstract:
Handheld ultrasound devices face usage limitations due to user inexperience and cannot benefit from supervised deep learning without extensive expert annotations. Moreover, the models trained on standard ultrasound device data are constrained by training data distribution and perform poorly when directly applied to handheld device data. In this study, we propose the Training-free Image Style Align…
▽ More
Handheld ultrasound devices face usage limitations due to user inexperience and cannot benefit from supervised deep learning without extensive expert annotations. Moreover, the models trained on standard ultrasound device data are constrained by training data distribution and perform poorly when directly applied to handheld device data. In this study, we propose the Training-free Image Style Alignment (TISA) framework to align the style of handheld device data to those of standard devices. The proposed TISA can directly infer handheld device images without extra training and is suited for clinical applications. We show that TISA performs better and more stably in medical detection and segmentation tasks for handheld device data. We further validate TISA as the clinical model for automatic measurements of spinal curvature and carotid intima-media thickness. The automatic measurements agree well with manual measurements made by human experts and the measurement errors remain within clinically acceptable ranges. We demonstrate the potential for TISA to facilitate automatic diagnosis on handheld ultrasound devices and expedite their eventual widespread use.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift
Authors:
Jisheng Bai,
Mou Wang,
Haohe Liu,
Han Yin,
Yafei Jia,
Siwei Huang,
Yutong Du,
Dongzhe Zhang,
Dongyuan Shi,
Woon-Seng Gan,
Mark D. Plumbley,
Susanto Rahardja,
Bin Xiang,
Jianfeng Chen
Abstract:
Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Althoug…
▽ More
Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task, in recent years, has achieved substantial progress in device generalization, the challenge of domain shift between different geographical regions, involving discrepancies such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift.
△ Less
Submitted 28 February, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Retrieval Augmented End-to-End Spoken Dialog Models
Authors:
Mingqiu Wang,
Izhak Shafran,
Hagen Soltau,
Wei Han,
Yuan Cao,
Dian Yu,
Laurent El Shafey
Abstract:
We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Task-oriented dialogs often contain dom…
▽ More
We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Task-oriented dialogs often contain domain-specific entities, i.e., restaurants, hotels, train stations, and city names, which are difficult to recognize, however, critical for the downstream applications. Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to retrieve text entities mentioned in the audio. The retrieved entities are then added as text inputs to the underlying SLM to bias model predictions. We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance, achieving joint goal accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach is broadly applicable to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
A Robust Super-resolution Gridless Imaging Framework for UAV-borne SAR Tomography
Authors:
Silin Gao,
Wenlong Wang,
Muhan Wang,
Zhe Zhang,
Zai Yang,
Xiaolan Qiu,
Bingchen Zhang,
Yirong Wu
Abstract:
Synthetic aperture radar (SAR) tomography (TomoSAR) retrieves three-dimensional (3-D) information from multiple SAR images, effectively addresses the layover problem, and has become pivotal in urban mapping. Unmanned aerial vehicle (UAV) has gained popularity as a TomoSAR platform, offering distinct advantages such as the ability to achieve 3-D imaging in a single flight, cost-effectiveness, rapid…
▽ More
Synthetic aperture radar (SAR) tomography (TomoSAR) retrieves three-dimensional (3-D) information from multiple SAR images, effectively addresses the layover problem, and has become pivotal in urban mapping. Unmanned aerial vehicle (UAV) has gained popularity as a TomoSAR platform, offering distinct advantages such as the ability to achieve 3-D imaging in a single flight, cost-effectiveness, rapid deployment, and flexible trajectory planning. The evolution of compressed sensing (CS) has led to the widespread adoption of sparse reconstruction techniques in TomoSAR signal processing, with a focus on $\ell _1$ norm regularization and other grid-based CS methods. However, the discretization of illuminated scene along elevation introduces modeling errors, resulting in reduced reconstruction accuracy, known as the "off-grid" effect. Recent advancements have introduced gridless CS algorithms to mitigate this issue. This paper presents an innovative gridless 3-D imaging framework tailored for UAV-borne TomoSAR. Capitalizing on the pulse repetition frequency (PRF) redundancy inherent in slow UAV platforms, a multiple measurement vectors (MMV) model is constructed to enhance noise immunity without compromising azimuth-range resolution. Given the sparsely placed array elements due to mounting platform constraints, an atomic norm soft thresholding algorithm is proposed for partially observed MMV, offering gridless reconstruction capability and super-resolution. An efficient alternative optimization algorithm is also employed to enhance computational efficiency. Validation of the proposed framework is achieved through computer simulations and flight experiments, affirming its efficacy in UAV-borne TomoSAR applications.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Towards Arbitrary-Scale Histopathology Image Super-resolution: An Efficient Dual-branch Framework via Implicit Self-texture Enhancement
Authors:
Minghong Duan,
Linhao Qu,
Zhiwei Yang,
Manning Wang,
Chenxi Zhang,
Zhijian Song
Abstract:
High-quality whole-slide scanners are expensive, complex, and time-consuming, thus limiting the acquisition and utilization of high-resolution pathology whole-slide images in daily clinical work. Deep learning-based single-image super-resolution techniques are an effective way to solve this problem by synthesizing high-resolution images from low-resolution ones. However, the existing super-resolut…
▽ More
High-quality whole-slide scanners are expensive, complex, and time-consuming, thus limiting the acquisition and utilization of high-resolution pathology whole-slide images in daily clinical work. Deep learning-based single-image super-resolution techniques are an effective way to solve this problem by synthesizing high-resolution images from low-resolution ones. However, the existing super-resolution models applied in pathology images can only work in fixed integer magnifications, significantly decreasing their applicability. Though methods based on implicit neural representation have shown promising results in arbitrary-scale super-resolution of natural images, applying them directly to pathology images is inadequate because they have unique fine-grained image textures different from natural images. Thus, we propose an Implicit Self-Texture Enhancement-based dual-branch framework (ISTE) for arbitrary-scale super-resolution of pathology images to address this challenge. ISTE contains a pixel learning branch and a texture learning branch, which first learn pixel features and texture features, respectively. Then, we design a two-stage texture enhancement strategy to fuse the features from the two branches to obtain the super-resolution results, where the first stage is feature-based texture enhancement, and the second stage is spatial-domain-based texture enhancement. Extensive experiments on three public datasets show that ISTE outperforms existing fixed-scale and arbitrary-scale algorithms at multiple magnifications and helps to improve downstream task performance. To the best of our knowledge, this is the first work to achieve arbitrary-scale super-resolution in pathology images. Codes will be available.
△ Less
Submitted 15 July, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music
Authors:
Han Yin,
Mou Wang,
Jisheng Bai,
Dongyuan Shi,
Woon-Seng Gan,
Jianfeng Chen
Abstract:
This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines.
This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines.
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction
Authors:
Jiaxin Guo,
Minghan Wang,
Xiaosong Qiao,
Daimeng Wei,
Hengchao Shang,
Zongyao Li,
Zhengzhe Yu,
Yinglu Li,
Chang Su,
Min Zhang,
Shimin Tao,
Hao Yang
Abstract:
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tu…
▽ More
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Distance Guided Generative Adversarial Network for Explainable Binary Classifications
Authors:
Xiangyu Xiong,
Yue Sun,
Xiaohong Liu,
Wei Ke,
Chan-Tong Lam,
Jiangang Chen,
Mingfeng Jiang,
Mingwei Wang,
Hui Xie,
Tong Tong,
Qinquan Gao,
Hao Chen,
Tao Tan
Abstract:
Despite the potential benefits of data augmentation for mitigating the data insufficiency, traditional augmentation methods primarily rely on the prior intra-domain knowledge. On the other hand, advanced generative adversarial networks (GANs) generate inter-domain samples with limited variety. These previous methods make limited contributions to describing the decision boundaries for binary classi…
▽ More
Despite the potential benefits of data augmentation for mitigating the data insufficiency, traditional augmentation methods primarily rely on the prior intra-domain knowledge. On the other hand, advanced generative adversarial networks (GANs) generate inter-domain samples with limited variety. These previous methods make limited contributions to describing the decision boundaries for binary classification. In this paper, we propose a distance guided GAN (DisGAN) which controls the variation degrees of generated samples in the hyperplane space. Specifically, we instantiate the idea of DisGAN by combining two ways. The first way is vertical distance GAN (VerDisGAN) where the inter-domain generation is conditioned on the vertical distances. The second way is horizontal distance GAN (HorDisGAN) where the intra-domain generation is conditioned on the horizontal distances. Furthermore, VerDisGAN can produce the class-specific regions by mapping the source images to the hyperplane. Experimental results show that DisGAN consistently outperforms the GAN-based augmentation methods with explainable binary classification. The proposed method can apply to different classification architectures and has potential to extend to multi-class classification.
△ Less
Submitted 29 December, 2023;
originally announced December 2023.
-
Speech Translation with Large Language Models: An Industrial Practice
Authors:
Zhichao Huang,
Rong Ye,
Tom Ko,
Qianqian Dong,
Shanbo Cheng,
Mingxuan Wang,
Hang Li
Abstract:
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au…
▽ More
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/llm-st/.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Authors:
Xueyao Zhang,
Liumeng Xue,
Yicheng Gu,
Yuancheng Wang,
Haorui He,
Chaoren Wang,
Xi Chen,
Zihao Fang,
Haopeng Chen,
Junan Zhang,
Tze Ying Tang,
Lexiao Zou,
Mingxuan Wang,
Jun Han,
Kai Chen,
Haizhou Li,
Zhizheng Wu
Abstract:
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, a…
▽ More
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. Additionally, it provides interactive visualizations and demonstrations of classic models for educational purposes. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion.
△ Less
Submitted 22 February, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
UniTSA: A Universal Reinforcement Learning Framework for V2X Traffic Signal Control
Authors:
Maonan Wang,
Xi Xiong,
Yuheng Kan,
Chengcheng Xu,
Man-On Pun
Abstract:
Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this work, a universal RL-based TSC framework is propo…
▽ More
Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this work, a universal RL-based TSC framework is proposed for Vehicle-to-Everything (V2X) environments. The proposed framework introduces a novel agent design that incorporates a junction matrix to characterize intersection states, making the proposed model applicable to diverse intersections. To equip the proposed RL-based framework with enhanced capability of handling various intersection structures, novel traffic state augmentation methods are tailor-made for signal light control systems. Finally, extensive experimental results derived from multiple intersection configurations confirm the effectiveness of the proposed framework. The source code in this work is available at https://github.com/wmn7/Universal_Light
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Sequencing-enabled Hierarchical Cooperative CAV On-ramp Merging Control with Enhanced Stability and Feasibility
Authors:
Sixu Li,
Yang Zhou,
Xinyue Ye,
Jiwan Jiang,
Meng Wang
Abstract:
This paper develops a sequencing-enabled hierarchical connected automated vehicle (CAV) cooperative on-ramp merging control framework. The proposed framework consists of a two-layer design: the upper level control sequences the vehicles to harmonize the traffic density across mainline and on-ramp segments while enhancing lower-level control efficiency through a mixed-integer linear programming for…
▽ More
This paper develops a sequencing-enabled hierarchical connected automated vehicle (CAV) cooperative on-ramp merging control framework. The proposed framework consists of a two-layer design: the upper level control sequences the vehicles to harmonize the traffic density across mainline and on-ramp segments while enhancing lower-level control efficiency through a mixed-integer linear programming formulation. Subsequently, the lower-level control employs a longitudinal distributed model predictive control (MPC) supplemented by a virtual car-following (CF) concept to ensure asymptotic local stability, l_2 norm string stability, and safety. Proofs of asymptotic local stability and l_2 norm string stability are mathematically derived. Compared to other prevalent asymptotic local-stable MPC controllers, the proposed distributed MPC controller greatly expands the initial feasible set. Additionally, an auxiliary lateral control is developed to maintain lane-keeping and merging smoothness while accommodating ramp geometric curvature. To validate the proposed framework, multiple numerical experiments are conducted. Results indicate a notable outperformance of our upper-level controller against a distance-based sequencing method. Furthermore, the lower-level control effectively ensures smooth acceleration, safe merging with adequate spacing, adherence to proven longitudinal local and string stability, and rapid regulation of lateral deviations.
△ Less
Submitted 25 May, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection
Authors:
Han Yin,
Jisheng Bai,
Mou Wang,
Dongyuan Shi,
Woon-Seng Gan,
Jianfeng Chen
Abstract:
Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-…
▽ More
Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-interaction mechanism is applied to effectively exploit the information from soft labels. In addition, a novel scene-inspired mask (SIM) based on soft labels is incorporated for more precise SED predictions. The SIM is initially generated through a statistical approach, referred as SIM-V1. However, the fixed artificial mask may mismatch the SED model, resulting in limited effectiveness. Therefore, we further propose SIM-V2, which employs a word embedding model for adaptive SIM estimation. Experimental results show that the proposed IDC module can effectively utilize the information from soft labels, and the integration of SIM-V1 can further improve the accuracy. In addition, the impact of different word embedding dimensions on SIM-V2 is explored, and the results show that the appropriate dimension can enable SIM-V2 achieve superior performance than SIM-V1. In DCASE 2023 Challenge Task4B, the proposed system achieved the top ranking performance on the evaluation dataset of MAESTRO Real.
△ Less
Submitted 7 December, 2023; v1 submitted 23 November, 2023;
originally announced November 2023.
-
AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning
Authors:
Jisheng Bai,
Han Yin,
Mou Wang,
Dongyuan Shi,
Woon-Seng Gan,
Jianfeng Chen,
Susanto Rahardja
Abstract:
Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-sema…
▽ More
Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-semantic audio Transformer by incorporating contrastive learning between hybrid acoustic representations. We then leverage LLMs to generate audio logs that summarize textual descriptions of the acoustic environment. Finally, we evaluate the AudioLog system on two datasets with both scene and event annotations. Experiments show that the proposed system achieves exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analysis of the prompts to LLMs demonstrates that AudioLog can effectively summarize long audio sequences. To the best of our knowledge, this approach is the first attempt to leverage LLMs for summarizing long audio sequences.
△ Less
Submitted 4 January, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
NOMA Enabled Multi-Access Edge Computing: A Joint MU-MIMO Precoding and Computation Offloading Design
Authors:
Deyou Zhang,
Meng Wang,
Shuo Shi,
Ming Xiao
Abstract:
This letter investigates computation offloading and transmit precoding co-design for multi-access edge computing (MEC), where multiple MEC users (MUs) equipped with multiple antennas access the MEC server in a non-orthogonal multiple access manner. We aim to minimize the total energy consumption of all MUs while satisfying the latency constraints by jointly optimizing the computational frequency,…
▽ More
This letter investigates computation offloading and transmit precoding co-design for multi-access edge computing (MEC), where multiple MEC users (MUs) equipped with multiple antennas access the MEC server in a non-orthogonal multiple access manner. We aim to minimize the total energy consumption of all MUs while satisfying the latency constraints by jointly optimizing the computational frequency, offloading ratio, and precoding matrix of each MU. For tractability, we first decompose the original problem into three subproblems and then solve these subproblems iteratively until convergence. Simulation results validate the convergence of the proposed method and demonstrate its superiority over baseline algorithms.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.