Search | arXiv e-print repository

Adversarial Contrastive Learning Based Physics-Informed Temporal Networks for Cuffless Blood Pressure Estimation

Authors: Rui Wang, Mengshi Qi, Yingxia Shao, Anfu Zhou, Huadong Ma

Abstract: Time series data mining is immensely important in extensive applications, such as traffic, medical, and e-commerce. In this paper, we focus on medical temporal variation modeling, \emph{i.e.,} cuffless blood pressure (BP) monitoring which has great value in cardiovascular healthcare. Although providing a comfortable user experience, such methods are suffering from the demand for a significant amou… ▽ More Time series data mining is immensely important in extensive applications, such as traffic, medical, and e-commerce. In this paper, we focus on medical temporal variation modeling, \emph{i.e.,} cuffless blood pressure (BP) monitoring which has great value in cardiovascular healthcare. Although providing a comfortable user experience, such methods are suffering from the demand for a significant amount of realistic data to train an individual model for each subject, especially considering the invasive or obtrusive BP ground-truth measurements. To tackle this challenge, we introduce a novel physics-informed temporal network~(PITN) with adversarial contrastive learning to enable precise BP estimation with very limited data. Specifically, we first enhance the physics-informed neural network~(PINN) with the temporal block for investigating BP dynamics' multi-periodicity for personal cardiovascular cycle modeling and temporal variation. We then employ adversarial training to generate extra physiological time series data, improving PITN's robustness in the face of sparse subject-specific training data. Furthermore, we utilize contrastive learning to capture the discriminative variations of cardiovascular physiologic phenomena. This approach aggregates physiological signals with similar blood pressure values in latent space while separating clusters of samples with dissimilar blood pressure values. Experiments on three widely-adopted datasets with different modailties (\emph{i.e.,} bioimpedance, PPG, millimeter-wave) demonstrate the superiority and effectiveness of the proposed methods over previous state-of-the-art approaches. The code is available at~\url{https://github.com/Zest86/ACL-PITN}. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: 14 pages, 8 figures

arXiv:2407.02616 [pdf]

Deep Learning Based Apparent Diffusion Coefficient Map Generation from Multi-parametric MR Images for Patients with Diffuse Gliomas

Authors: Zach Eidex, Mojtaba Safari, Jacob Wynne, Richard L. J. Qiu, Tonghe Wang, David Viar Hernandez, Hui-Kuo Shu, Hui Mao, Xiaofeng Yang

Abstract: Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We pro… ▽ More Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We proposed the multiparametric residual vision transformer model (MPR-ViT) that leverages the long-range context of ViT layers along with the precision of convolutional operators. Residual blocks throughout the network significantly increasing the representational power of the model. The MPR-ViT model was applied to T1w and T2- fluid attenuated inversion recovery images of 501 glioma cases from a publicly available dataset including preprocessed ADC maps. Selected patients were divided into training (N=400), validation (N=50) and test (N=51) sets, respectively. Using the preprocessed ADC maps as ground truth, model performance was evaluated and compared against the Vision Convolutional Transformer (VCT) and residual vision transformer (ResViT) models. Results: The results are as follows using T1w + T2-FLAIR MRI as inputs: MPR-ViT - PSNR: 31.0 +/- 2.1, MSE: 0.009 +/- 0.0005, SSIM: 0.950 +/- 0.015. In addition, ablation studies showed the relative impact on performance of each input sequence. Both qualitative and quantitative results indicate that the proposed MR- ViT model performs favorably against the ground truth data. Conclusion: We show that high-quality ADC maps can be synthesized from structural MRI using a MPR- VCT model. Our predicted images show better conformality to the ground truth volume than ResViT and VCT predictions. These high-quality synthetic ADC maps would be particularly useful for disease diagnosis and intervention, especially when ADC maps have artifacts or are unavailable. △ Less

Submitted 4 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2311.15044

arXiv:2406.16896 [pdf, other]

f-GAN: A frequency-domain-constrained generative adversarial network for PPG to ECG synthesis

Authors: Nathan C. L. Kong, Dae Lee, Huyen Do, Dae Hoon Park, Cong Xu, Hongda Mao, Jonathan Chung

Abstract: Electrocardiograms (ECGs) and photoplethysmograms (PPGs) are generally used to monitor an individual's cardiovascular health. In clinical settings, ECGs and fingertip PPGs are the main signals used for assessing cardiovascular health, but the equipment necessary for their collection precludes their use in daily monitoring. Although PPGs obtained from wrist-worn devices are susceptible to noise due… ▽ More Electrocardiograms (ECGs) and photoplethysmograms (PPGs) are generally used to monitor an individual's cardiovascular health. In clinical settings, ECGs and fingertip PPGs are the main signals used for assessing cardiovascular health, but the equipment necessary for their collection precludes their use in daily monitoring. Although PPGs obtained from wrist-worn devices are susceptible to noise due to motion, they have been widely used to continuously monitor cardiovascular health because of their convenience. Therefore, we would like to combine the ease with which PPGs can be collected with the information that ECGs provide about cardiovascular health by developing models to synthesize ECG signals from paired PPG signals. We tackled this problem using generative adversarial networks (GANs) and found that models trained using the original GAN formulations can be successfully used to synthesize ECG signals from which heart rate can be extracted using standard signal processing pipelines. Incorporating a frequency-domain constraint to model training improved the stability of model performance and also the performance on heart rate estimation. △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2406.04737 [pdf, other]

doi 10.1109/TWC.2024.3425473

Fast-Fading Channel and Power Optimization of the Magnetic Inductive Cellular Network

Authors: Honglei Ma, Erwu Liu, Zhijun Fang, Rui Wang, Yongbin Gao, Wenjun Yu, Dongming Zhang

Abstract: The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cann… ▽ More The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cannot be modeled by the central limit theorem, it differs radically from other wireless fast-fading channels. Unfortunately, few studies focus on this phenomenon. In this paper, using a novel space modeling based on the electromagnetic field theorem, we propose a 3-dimension model of the VMI antenna vibration. By proposing ``conjugate pseudo-piecewise functions'' and boundary $p(x)$ distribution, we derive the cumulative distribution function (CDF), probability density function (PDF) and the expectation of the VMI fast-fading channel. We also theoretically analyze the effects of the VMI fast-fading on the network throughput, including the VMI outage probability which can be ignored in the traditional MI channel study. We draw several intriguing conclusions different from those in wireless fast-fading studies. For instance, the fast-fading brings more uniformly distributed channel coefficients. Finally, we propose the power control algorithm using the non-cooperative game and multiagent Q-learning methods to optimize the throughput of the cellular VMI network. Simulations validate the derivation and the proposed algorithm. △ Less

Submitted 7 July, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: This work has been accepted by the IEEE TWC for publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.18731 [pdf, other]

VBIM-Net: Variational Born Iterative Network for Inverse Scattering Problems

Authors: Ziqing Xing, Zhaoyang Zhang, Zirui Chen, Yusong Wang, Haoran Ma, Zhun Wei, Gang Bao

Abstract: Recently, studies have shown the potential of integrating field-type iterative methods with deep learning (DL) techniques in solving inverse scattering problems (ISPs). In this article, we propose a novel Variational Born Iterative Network, namely, VBIM-Net, to solve the full-wave ISPs with significantly improved flexibility and inversion quality. The proposed VBIM-Net emulates the alternating upd… ▽ More Recently, studies have shown the potential of integrating field-type iterative methods with deep learning (DL) techniques in solving inverse scattering problems (ISPs). In this article, we propose a novel Variational Born Iterative Network, namely, VBIM-Net, to solve the full-wave ISPs with significantly improved flexibility and inversion quality. The proposed VBIM-Net emulates the alternating updates of the total electric field and the contrast in the variational Born iterative method (VBIM) by multiple layers of subnetworks. We embed the calculation of the contrast variation into each of the subnetworks, converting the scattered field residual into an approximate contrast variation and then enhancing it by a U-Net, thus avoiding the requirement of matched measurement dimension and grid resolution as in existing approaches. The total field and contrast of each layer's output is supervised in the loss function of VBIM-Net, which guarantees the physical interpretability of variables of the subnetworks. In addition, we design a training scheme with extra noise to enhance the model's stability. Extensive numerical results on synthetic and experimental data both verify the inversion quality, generalization ability, and robustness of the proposed VBIM-Net. This work may provide some new inspiration for the design of efficient field-type DL schemes. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 14 pages, 21 figures

arXiv:2405.17702 [pdf]

A Two-sided Model for EV Market Dynamics and Policy Implications

Authors: Haoxuan Ma, Brian Yueshuai He, Tomas Kaljevic, Jiaqi Ma

Abstract: The diffusion of Electric Vehicles (EVs) plays a pivotal role in mitigating greenhouse gas emissions, particularly in the U.S., where ambitious zero-emission and carbon neutrality objectives have been set. In pursuit of these goals, many states have implemented a range of incentive policies aimed at stimulating EV adoption and charging infrastructure development, especially public EV charging stat… ▽ More The diffusion of Electric Vehicles (EVs) plays a pivotal role in mitigating greenhouse gas emissions, particularly in the U.S., where ambitious zero-emission and carbon neutrality objectives have been set. In pursuit of these goals, many states have implemented a range of incentive policies aimed at stimulating EV adoption and charging infrastructure development, especially public EV charging stations (EVCS). This study examines the indirect network effect observed between EV adoption and EVCS deployment within urban landscapes. We developed a two-sided log-log regression model with historical data on EV purchases and EVCS development to quantify this effect. To test the robustness, we then conducted a case study of the EV market in Los Angeles (LA) County, which suggests that a 1% increase in EVCS correlates with a 0.35% increase in EV sales. Additionally, we forecasted the future EV market dynamics in LA County, revealing a notable disparity between current policies and the targeted 80% EV market share for private cars by 2045. To bridge this gap, we proposed a combined policy recommendation that enhances EV incentives by 60% and EVCS rebates by 66%, facilitating the achievement of future EV market objectives. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Conference preprint, 8 pages, 3 figures

arXiv:2405.05787 [pdf, other]

Autonomous Robotic Ultrasound System for Liver Follow-up Diagnosis: Pilot Phantom Study

Authors: Tianpeng Zhang, Sekeun Kim, Jerome Charton, Haitong Ma, Kyungsang Kim, Na Li, Quanzheng Li

Abstract: The paper introduces a novel autonomous robot ultrasound (US) system targeting liver follow-up scans for outpatients in local communities. Given a computed tomography (CT) image with specific target regions of interest, the proposed system carries out the autonomous follow-up scan in three steps: (i) initial robot contact to surface, (ii) coordinate mapping between CT image and robot, and (iii) ta… ▽ More The paper introduces a novel autonomous robot ultrasound (US) system targeting liver follow-up scans for outpatients in local communities. Given a computed tomography (CT) image with specific target regions of interest, the proposed system carries out the autonomous follow-up scan in three steps: (i) initial robot contact to surface, (ii) coordinate mapping between CT image and robot, and (iii) target US scan. Utilizing 3D US-CT registration and deep learning-based segmentation networks, we can achieve precise imaging of 3D hepatic veins, facilitating accurate coordinate mapping between CT and the robot. This enables the automatic localization of follow-up targets within the CT image, allowing the robot to navigate precisely to the target's surface. Evaluation of the ultrasound phantom confirms the quality of the US-CT registration and shows the robot reliably locates the targets in repeated trials. The proposed framework holds the potential to significantly reduce time and costs for healthcare providers, clinicians, and follow-up patients, thereby addressing the increasing healthcare burden associated with chronic disease in local communities. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.11313 [pdf, other]

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Authors: Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

arXiv:2403.14059 [pdf]

PE-GPT: A Physics-Informed Interactive Large Language Model for Power Converter Modulation Design

Authors: Fanfan Lin, Junhua Liu, Xinze Li, Shuai Zhao, Bohui Zhao, Hao Ma, Xin Zhang

Abstract: This paper proposes PE-GPT, a custom-tailored large language model uniquely adapted for power converter modulation design. By harnessing in-context learning and specialized tiered physics-informed neural networks, PE-GPT guides users through text-based dialogues, recommending actionable modulation parameters. The effectiveness of PE-GPT is validated through a practical design case involving dual a… ▽ More This paper proposes PE-GPT, a custom-tailored large language model uniquely adapted for power converter modulation design. By harnessing in-context learning and specialized tiered physics-informed neural networks, PE-GPT guides users through text-based dialogues, recommending actionable modulation parameters. The effectiveness of PE-GPT is validated through a practical design case involving dual active bridge converters, supported by hardware experimentation. This research underscores the transformative potential of large language models in power converter modulation design, offering enhanced accessibility, explainability, and efficiency, thereby setting a new paradigm in the field. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.05937 [pdf, other]

Wavelet-Like Transform-Based Technology in Response to the Call for Proposals on Neural Network-Based Image Coding

Authors: Cunhui Dong, Haichuan Ma, Haotian Zhang, Changsheng Gao, Li Li, Dong Liu

Abstract: Neural network-based image coding has been developing rapidly since its birth. Until 2022, its performance has surpassed that of the best-performing traditional image coding framework -- H.266/VVC. Witnessing such success, the IEEE 1857.11 working subgroup initializes a neural network-based image coding standard project and issues a corresponding call for proposals (CfP). In response to the CfP, t… ▽ More Neural network-based image coding has been developing rapidly since its birth. Until 2022, its performance has surpassed that of the best-performing traditional image coding framework -- H.266/VVC. Witnessing such success, the IEEE 1857.11 working subgroup initializes a neural network-based image coding standard project and issues a corresponding call for proposals (CfP). In response to the CfP, this paper introduces a novel wavelet-like transform-based end-to-end image coding framework -- iWaveV3. iWaveV3 incorporates many new features such as affine wavelet-like transform, perceptual-friendly quality metric, and more advanced training and online optimization strategies into our previous wavelet-like transform-based framework iWave++. While preserving the features of supporting lossy and lossless compression simultaneously, iWaveV3 also achieves state-of-the-art compression efficiency for objective quality and is very competitive for perceptual quality. As a result, iWaveV3 is adopted as a candidate scheme for developing the IEEE Standard for neural-network-based image coding. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2402.17455 [pdf, ps, other]

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Authors: Hao Ma, Zhiyuan Peng, Xu Li, Mingjie Shao, Xixin Wu, Ju Liu

Abstract: Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models f… ▽ More Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to improve the models' performance and generalizability. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation. △ Less

Submitted 8 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2401.13220 [pdf, other]

Segment Any Cell: A SAM-based Auto-prompting Fine-tuning Framework for Nuclei Segmentation

Authors: Saiyang Na, Yuzhi Guo, Feng Jiang, Hehuan Ma, Junzhou Huang

Abstract: In the rapidly evolving field of AI research, foundational models like BERT and GPT have significantly advanced language and vision tasks. The advent of pretrain-prompting models such as ChatGPT and Segmentation Anything Model (SAM) has further revolutionized image segmentation. However, their applications in specialized areas, particularly in nuclei segmentation within medical imaging, reveal a k… ▽ More In the rapidly evolving field of AI research, foundational models like BERT and GPT have significantly advanced language and vision tasks. The advent of pretrain-prompting models such as ChatGPT and Segmentation Anything Model (SAM) has further revolutionized image segmentation. However, their applications in specialized areas, particularly in nuclei segmentation within medical imaging, reveal a key challenge: the generation of high-quality, informative prompts is as crucial as applying state-of-the-art (SOTA) fine-tuning techniques on foundation models. To address this, we introduce Segment Any Cell (SAC), an innovative framework that enhances SAM specifically for nuclei segmentation. SAC integrates a Low-Rank Adaptation (LoRA) within the attention layer of the Transformer to improve the fine-tuning process, outperforming existing SOTA methods. It also introduces an innovative auto-prompt generator that produces effective prompts to guide segmentation, a critical factor in handling the complexities of nuclei segmentation in biomedical imaging. Our extensive experiments demonstrate the superiority of SAC in nuclei segmentation tasks, proving its effectiveness as a tool for pathologists and researchers. Our contributions include a novel prompt generation strategy, automated adaptability for diverse segmentation tasks, the innovative application of Low-Rank Attention Adaptation in SAM, and a versatile framework for semantic segmentation challenges. △ Less

Submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.01496 [pdf, other]

From Pixel to Slide image: Polarization Modality-based Pathological Diagnosis Using Representation Learning

Authors: Jia Dong, Yao Yao, Yang Dong, Hui Ma

Abstract: Thyroid cancer is the most common endocrine malignancy, and accurately distinguishing between benign and malignant thyroid tumors is crucial for developing effective treatment plans in clinical practice. Pathologically, thyroid tumors pose diagnostic challenges due to improper specimen sampling. In this study, we have designed a three-stage model using representation learning to integrate pixel-le… ▽ More Thyroid cancer is the most common endocrine malignancy, and accurately distinguishing between benign and malignant thyroid tumors is crucial for developing effective treatment plans in clinical practice. Pathologically, thyroid tumors pose diagnostic challenges due to improper specimen sampling. In this study, we have designed a three-stage model using representation learning to integrate pixel-level and slice-level annotations for distinguishing thyroid tumors. This structure includes a pathology structure recognition method to predict structures related to thyroid tumors, an encoder-decoder network to extract pixel-level annotation information by learning the feature representations of image blocks, and an attention-based learning mechanism for the final classification task. This mechanism learns the importance of different image blocks in a pathological region, globally considering the information from each block. In the third stage, all information from the image blocks in a region is aggregated using attention mechanisms, followed by classification to determine the category of the region. Experimental results demonstrate that our proposed method can predict microscopic structures more accurately. After color-coding, the method achieves results on unstained pathology slides that approximate the quality of Hematoxylin and eosin staining, reducing the need for stained pathology slides. Furthermore, by leveraging the concept of indirect measurement and extracting polarized features from structures correlated with lesions, the proposed method can also classify samples where membrane structures cannot be obtained through sampling, providing a potential objective and highly accurate indirect diagnostic technique for thyroid tumors. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.16607 [pdf, other]

A Polarization and Radiomics Feature Fusion Network for the Classification of Hepatocellular Carcinoma and Intrahepatic Cholangiocarcinoma

Authors: Jia Dong, Yao Yao, Liyan Lin, Yang Dong, Jiachen Wan, Ran Peng, Chao Li, Hui Ma

Abstract: Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mu… ▽ More Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mueller matrix images of liver pathological samples with radiomics features derived from corresponding pathological images to classify HCC and ICC. Our fusion network integrates a two-tier fusion approach, comprising early feature-level fusion and late classification-level fusion. By harnessing the strengths of polarization imaging techniques and image feature-based machine learning, our proposed fusion network significantly enhances classification accuracy. Notably, even at reduced imaging resolutions, the fusion network maintains robust performance due to the additional information provided by polarization features, which may not align with human visual perception. Our experimental results underscore the potential of this fusion network as a powerful tool for computer-aided diagnosis of HCC and ICC, showcasing the benefits and prospects of integrating polarization imaging techniques into the current image-intensive digital pathological diagnosis. We aim to contribute this innovative approach to top-tier journals, offering fresh insights and valuable tools in the fields of medical imaging and cancer diagnosis. By introducing polarization imaging into liver cancer classification, we demonstrate its interdisciplinary potential in addressing challenges in medical image analysis, promising advancements in medical imaging and cancer diagnosis. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.15863 [pdf, other]

PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning

Authors: Hangyu Mao, Rui Zhao, Ziyue Li, Zhiwei Xu, Hao Chen, Yiqun Chen, Bin Zhang, Zhen Xiao, Junge Zhang, Jiangjin Yin

Abstract: Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at t… ▽ More Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at the patch level, whereas the deciding one pays attention to \emph{the decision-making} by conditioning on the history of the desired returns, the perceiver's outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e.g., both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at \url{https://github.com/maohangyu/PDiT}. △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024, full paper with oral presentation). Cover our preliminary study: arXiv:2212.14538

arXiv:2312.08079 [pdf, other]

Extending Whisper with prompt tuning to target-speaker ASR

Authors: Hao Ma, Zhiyuan Peng, Mingjie Shao, Jing Li, Ju Liu

Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages pro… ▽ More Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative. △ Less

Submitted 11 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: ICASSP 2024

arXiv:2310.00413 [pdf, other]

SSIF: Learning Continuous Image Representation for Spatial-Spectral Super-Resolution

Authors: Gengchen Mai, Ni Lao, Weiwei Sun, Yuchi Ma, Jiaming Song, Chenlin Meng, Hongxu Ma, Jinmeng Rao, Ziyuan Li, Stefano Ermon

Abstract: Existing digital sensors capture images at fixed spatial and spectral resolutions (e.g., RGB, multispectral, and hyperspectral images), and each combination requires bespoke machine learning models. Neural Implicit Functions partially overcome the spatial resolution challenge by representing an image in a resolution-independent way. However, they still operate at fixed, pre-defined spectral resolu… ▽ More Existing digital sensors capture images at fixed spatial and spectral resolutions (e.g., RGB, multispectral, and hyperspectral images), and each combination requires bespoke machine learning models. Neural Implicit Functions partially overcome the spatial resolution challenge by representing an image in a resolution-independent way. However, they still operate at fixed, pre-defined spectral resolutions. To address this challenge, we propose Spatial-Spectral Implicit Function (SSIF), a neural implicit model that represents an image as a function of both continuous pixel coordinates in the spatial domain and continuous wavelengths in the spectral domain. We empirically demonstrate the effectiveness of SSIF on two challenging spatio-spectral super-resolution benchmarks. We observe that SSIF consistently outperforms state-of-the-art baselines even when the baselines are allowed to train separate models at each spectral resolution. We show that SSIF generalizes well to both unseen spatial resolutions and spectral resolutions. Moreover, SSIF can generate high-resolution images that improve the performance of downstream tasks (e.g., land use classification) by 1.7%-7%. △ Less

Submitted 30 September, 2023; originally announced October 2023.

MSC Class: 68T07; 68T45 ACM Class: I.4.10; I.2.10; I.4.6

arXiv:2309.02259 [pdf, ps, other]

Design of a New CIM-DCSK-Based Ambient Backscatter Communication System

Authors: Ruipeng Yang, Yi Fang, Pingping Chen, Huan Ma

Abstract: To improve the data rate in differential chaos shift keying (DCSK) based ambient backscatter communication (AmBC) system, we propose a new AmBC system based on code index modulation (CIM), referred to as CIM-DCSK-AmBC system. In the proposed system, the CIM-DCSK signal transmitted in the direct link is used as the radio frequency source of the backscatter link. The signal format in the backscatter… ▽ More To improve the data rate in differential chaos shift keying (DCSK) based ambient backscatter communication (AmBC) system, we propose a new AmBC system based on code index modulation (CIM), referred to as CIM-DCSK-AmBC system. In the proposed system, the CIM-DCSK signal transmitted in the direct link is used as the radio frequency source of the backscatter link. The signal format in the backscatter link is designed to increase the data rate as well as eliminate the interference of the direct link signal. As such, the direct link signal and the backscatter link signal can be received and demodulated simultaneously. Moreover, we derive and validate the theoretical bit error rate (BER) expressions of the CIM-DCSK-AmBC system over multipath Rayleigh fading channels. Regarding the short reference DCSK-based AmBC (SR-DCSK-AmBC) system as a benchmark system, numerical results reveal that the CIM-DCSK-AmBC system can achieve better BER performance in the direct link and higher throughput in the backscatter link than the benchmark system. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2308.14562 [pdf, other]

Data-Efficient Online Learning of Ball Placement in Robot Table Tennis

Authors: Philip Tobuschat, Hao Ma, Dieter Büchler, Bernhard Schölkopf, Michael Muehlebach

Abstract: We present an implementation of an online optimization algorithm for hitting a predefined target when returning ping-pong balls with a table tennis robot. The online algorithm optimizes over so-called interception policies, which define the manner in which the robot arm intercepts the ball. In our case, these are composed of the state of the robot arm (position and velocity) at interception time.… ▽ More We present an implementation of an online optimization algorithm for hitting a predefined target when returning ping-pong balls with a table tennis robot. The online algorithm optimizes over so-called interception policies, which define the manner in which the robot arm intercepts the ball. In our case, these are composed of the state of the robot arm (position and velocity) at interception time. Gradient information is provided to the optimization algorithm via the mapping from the interception policy to the landing point of the ball on the table, which is approximated with a black-box and a grey-box approach. Our algorithm is applied to a robotic arm with four degrees of freedom that is driven by pneumatic artificial muscles. As a result, the robot arm is able to return the ball onto any predefined target on the table after about 2-5 iterations. We highlight the robustness of our approach by showing rapid convergence with both the black-box and the grey-box gradients. In addition, the small number of iterations required to reach close proximity to the target also underlines the sample efficiency. A demonstration video can be found here: https://youtu.be/VC3KJoCss0k. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: 7 pages, 6 figures, to be published in proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023

arXiv:2308.13072 [pdf]

Full-dose Whole-body PET Synthesis from Low-dose PET Using High-efficiency Denoising Diffusion Probabilistic Model: PET Consistency Model

Authors: Shaoyan Pan, Elham Abouei, Junbo Peng, Joshua Qian, Jacob F Wynne, Tonghe Wang, Chih-Wei Chang, Justin Roper, Jonathon A Nye, Hui Mao, Xiaofeng Yang

Abstract: Objective: Positron Emission Tomography (PET) has been a commonly used imaging modality in broad clinical applications. One of the most important tradeoffs in PET imaging is between image quality and radiation dose: high image quality comes with high radiation exposure. Improving image quality is desirable for all clinical applications while minimizing radiation exposure is needed to reduce risk t… ▽ More Objective: Positron Emission Tomography (PET) has been a commonly used imaging modality in broad clinical applications. One of the most important tradeoffs in PET imaging is between image quality and radiation dose: high image quality comes with high radiation exposure. Improving image quality is desirable for all clinical applications while minimizing radiation exposure is needed to reduce risk to patients. Approach: We introduce PET Consistency Model (PET-CM), an efficient diffusion-based method for generating high-quality full-dose PET images from low-dose PET images. It employs a two-step process, adding Gaussian noise to full-dose PET images in the forward diffusion, and then denoising them using a PET Shifted-window Vision Transformer (PET-VIT) network in the reverse diffusion. The PET-VIT network learns a consistency function that enables direct denoising of Gaussian noise into clean full-dose PET images. PET-CM achieves state-of-the-art image quality while requiring significantly less computation time than other methods. Results: In experiments comparing eighth-dose to full-dose images, PET-CM demonstrated impressive performance with NMAE of 1.278+/-0.122%, PSNR of 33.783+/-0.824dB, SSIM of 0.964+/-0.009, NCC of 0.968+/-0.011, HRS of 4.543, and SUV Error of 0.255+/-0.318%, with an average generation time of 62 seconds per patient. This is a significant improvement compared to the state-of-the-art diffusion-based model with PET-CM reaching this result 12x faster. Similarly, in the quarter-dose to full-dose image experiments, PET-CM delivered competitive outcomes, achieving an NMAE of 0.973+/-0.066%, PSNR of 36.172+/-0.801dB, SSIM of 0.984+/-0.004, NCC of 0.990+/-0.005, HRS of 4.428, and SUV Error of 0.151+/-0.192% using the same generation process, which underlining its high quantitative and clinical precision in both denoising scenario. △ Less

Submitted 16 April, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

arXiv:2307.07434 [pdf, other]

Combining multitemporal optical and SAR data for LAI imputation with BiLSTM network

Authors: W. Zhao, F. Yin, H. Ma, Q. Wu, J. Gomez-Dans, P. Lewis

Abstract: The Leaf Area Index (LAI) is vital for predicting winter wheat yield. Acquisition of crop conditions via Sentinel-2 remote sensing images can be hindered by persistent clouds, affecting yield predictions. Synthetic Aperture Radar (SAR) provides all-weather imagery, and the ratio between its cross- and co-polarized channels (C-band) shows a high correlation with time series LAI over winter wheat re… ▽ More The Leaf Area Index (LAI) is vital for predicting winter wheat yield. Acquisition of crop conditions via Sentinel-2 remote sensing images can be hindered by persistent clouds, affecting yield predictions. Synthetic Aperture Radar (SAR) provides all-weather imagery, and the ratio between its cross- and co-polarized channels (C-band) shows a high correlation with time series LAI over winter wheat regions. This study evaluates the use of time series Sentinel-1 VH/VV for LAI imputation, aiming to increase spatial-temporal density. We utilize a bidirectional LSTM (BiLSTM) network to impute time series LAI and use half mean squared error for each time step as the loss function. We trained models on data from southern Germany and the North China Plain using only LAI data generated by Sentinel-1 VH/VV and Sentinel-2. Experimental results show BiLSTM outperforms traditional regression methods, capturing nonlinear dynamics between multiple time series. It proves robust in various growing conditions and is effective even with limited Sentinel-2 images. BiLSTM's performance surpasses that of LSTM, particularly over the senescence period. Therefore, BiLSTM can be used to impute LAI with time-series Sentinel-1 VH/VV and Sentinel-2 data, and this method could be applied to other time-series imputation issues. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2306.12085 [pdf, other]

HSR-Diff:Hyperspectral Image Super-Resolution via Conditional Diffusion Models

Authors: Chanyue Wu, Dong Wang, Hanyu Mao, Ying Li

Abstract: Despite the proven significance of hyperspectral images (HSIs) in performing various computer vision tasks, its potential is adversely affected by the low-resolution (LR) property in the spatial domain, resulting from multiple physical factors. Inspired by recent advancements in deep generative models, we propose an HSI Super-resolution (SR) approach with Conditional Diffusion Models (HSR-Diff) th… ▽ More Despite the proven significance of hyperspectral images (HSIs) in performing various computer vision tasks, its potential is adversely affected by the low-resolution (LR) property in the spatial domain, resulting from multiple physical factors. Inspired by recent advancements in deep generative models, we propose an HSI Super-resolution (SR) approach with Conditional Diffusion Models (HSR-Diff) that merges a high-resolution (HR) multispectral image (MSI) with the corresponding LR-HSI. HSR-Diff generates an HR-HSI via repeated refinement, in which the HR-HSI is initialized with pure Gaussian noise and iteratively refined. At each iteration, the noise is removed with a Conditional Denoising Transformer (CDF ormer) that is trained on denoising at different noise levels, conditioned on the hierarchical feature maps of HR-MSI and LR-HSI. In addition, a progressive learning strategy is employed to exploit the global information of full-resolution images. Systematic experiments have been conducted on four public datasets, demonstrating that HSR-Diff outperforms state-of-the-art methods. △ Less

Submitted 21 June, 2023; originally announced June 2023.

arXiv:2305.19467 [pdf]

Synthetic CT Generation from MRI using 3D Transformer-based Denoising Diffusion Model

Authors: Shaoyan Pan, Elham Abouei, Jacob Wynne, Tonghe Wang, Richard L. J. Qiu, Yuheng Li, Chih-Wei Chang, Junbo Peng, Justin Roper, Pretesh Patel, David S. Yu, Hui Mao, Xiaofeng Yang

Abstract: Magnetic resonance imaging (MRI)-based synthetic computed tomography (sCT) simplifies radiation therapy treatment planning by eliminating the need for CT simulation and error-prone image registration, ultimately reducing patient radiation dose and setup uncertainty. We propose an MRI-to-CT transformer-based denoising diffusion probabilistic model (MC-DDPM) to transform MRI into high-quality sCT to… ▽ More Magnetic resonance imaging (MRI)-based synthetic computed tomography (sCT) simplifies radiation therapy treatment planning by eliminating the need for CT simulation and error-prone image registration, ultimately reducing patient radiation dose and setup uncertainty. We propose an MRI-to-CT transformer-based denoising diffusion probabilistic model (MC-DDPM) to transform MRI into high-quality sCT to facilitate radiation treatment planning. MC-DDPM implements diffusion processes with a shifted-window transformer network to generate sCT from MRI. The proposed model consists of two processes: a forward process which adds Gaussian noise to real CT scans, and a reverse process in which a shifted-window transformer V-net (Swin-Vnet) denoises the noisy CT scans conditioned on the MRI from the same patient to produce noise-free CT scans. With an optimally trained Swin-Vnet, the reverse diffusion process was used to generate sCT scans matching MRI anatomy. We evaluated the proposed method by generating sCT from MRI on a brain dataset and a prostate dataset. Qualitative evaluation was performed using the mean absolute error (MAE) of Hounsfield unit (HU), peak signal to noise ratio (PSNR), multi-scale Structure Similarity index (MS-SSIM) and normalized cross correlation (NCC) indexes between ground truth CTs and sCTs. MC-DDPM generated brain sCTs with state-of-the-art quantitative results with MAE 43.317 HU, PSNR 27.046 dB, SSIM 0.965, and NCC 0.983. For the prostate dataset, MC-DDPM achieved MAE 59.953 HU, PSNR 26.920 dB, SSIM 0.849, and NCC 0.948. In conclusion, we have developed and validated a novel approach for generating CT images from routine MRIs using a transformer-based DDPM. This model effectively captures the complex relationship between CT and MRI images, allowing for robust and high-quality synthetic CT (sCT) images to be generated in minutes. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.15189 [pdf, other]

Black-Box vs. Gray-Box: A Case Study on Learning Table Tennis Ball Trajectory Prediction with Spin and Impacts

Authors: Jan Achterhold, Philip Tobuschat, Hao Ma, Dieter Buechler, Michael Muehlebach, Joerg Stueckler

Abstract: In this paper, we present a method for table tennis ball trajectory filtering and prediction. Our gray-box approach builds on a physical model. At the same time, we use data to learn parameters of the dynamics model, of an extended Kalman filter, and of a neural model that infers the ball's initial condition. We demonstrate superior prediction performance of our approach over two black-box approac… ▽ More In this paper, we present a method for table tennis ball trajectory filtering and prediction. Our gray-box approach builds on a physical model. At the same time, we use data to learn parameters of the dynamics model, of an extended Kalman filter, and of a neural model that infers the ball's initial condition. We demonstrate superior prediction performance of our approach over two black-box approaches, which are not supplied with physical prior knowledge. We demonstrate that initializing the spin from parameters of the ball launcher using a neural network drastically improves long-time prediction performance over estimating the spin purely from measured ball positions. An accurate prediction of the ball trajectory is crucial for successful returns. We therefore evaluate the return performance with a pneumatic artificial muscular robot and achieve a return rate of 29/30 (97.7%). △ Less

Submitted 12 June, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted for publication at the 5th Annual Conference on Learning for Dynamics and Control (L4DC) 2023 (camera-ready). With supplementary material

arXiv:2305.10000 [pdf, other]

Over-the-Air Federated Learning in MIMO Cloud-RAN Systems

Authors: Haoming Ma, Xiaojun Yuan, Zhi Ding

Abstract: To address the limitations of traditional over-the-air federated learning (OA-FL) such as limited server coverage and low resource utilization, we propose an OA-FL in MIMO cloud radio access network (MIMO Cloud-RAN) framework, where edge devices upload (or download) model parameters to the cloud server (CS) through access points (APs). Specifically, in every training round, there are three stages:… ▽ More To address the limitations of traditional over-the-air federated learning (OA-FL) such as limited server coverage and low resource utilization, we propose an OA-FL in MIMO cloud radio access network (MIMO Cloud-RAN) framework, where edge devices upload (or download) model parameters to the cloud server (CS) through access points (APs). Specifically, in every training round, there are three stages: edge aggregation; global aggregation; and model updating and broadcasting. To better utilize the correlation among APs, called inter-AP correlation, we propose modeling the global aggregation stage as a lossy distributed source coding (L-DSC) problem to make analysis from the perspective of rate-distortion theory. We further analyze the performance of the proposed OA-FL in MIMO Cloud-RAN framework. Based on the analysis, we formulate a communication-learning optimization problem to improve the system performance by considering the inter-AP correlation. To solve this problem, we develop an algorithm by using alternating optimization (AO) and majorization-minimization (MM), which effectively improves the FL learning performance. Furthermore, we propose a practical design that demonstrates the utilization of inter-AP correlation. The numerical results show that the proposed practical design effectively leverages inter-AP correlation, and outperforms other baseline schemes. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.05548 [pdf, ps, other]

CIT-EmotionNet: CNN Interactive Transformer Network for EEG Emotion Recognition

Authors: Wei Lu, Hua Ma, Tien-Ping Tan

Abstract: Emotion recognition using Electroencephalogram (EEG) signals has emerged as a significant research challenge in affective computing and intelligent interaction. However, effectively combining global and local features of EEG signals to improve performance in emotion recognition is still a difficult task. In this study, we propose a novel CNN Interactive Transformer Network for EEG Emotion Recognit… ▽ More Emotion recognition using Electroencephalogram (EEG) signals has emerged as a significant research challenge in affective computing and intelligent interaction. However, effectively combining global and local features of EEG signals to improve performance in emotion recognition is still a difficult task. In this study, we propose a novel CNN Interactive Transformer Network for EEG Emotion Recognition, known as CIT-EmotionNet, which efficiently integrates global and local features of EEG signals. Initially, we convert raw EEG signals into spatial-frequency representations, which serve as inputs. Then, we integrate Convolutional Neural Network (CNN) and Transformer within a single framework in a parallel manner. Finally, we design a CNN interactive Transformer module, which facilitates the interaction and fusion of local and global features, thereby enhancing the model's ability to extract both types of features from EEG spatial-frequency representations. The proposed CIT-EmotionNet outperforms state-of-the-art methods, achieving an average recognition accuracy of 98.57\% and 92.09\% on two publicly available datasets, SEED and SEED-IV, respectively. △ Less

Submitted 7 May, 2023; originally announced May 2023.

Comments: 10 pages,3 tables

arXiv:2305.05433 [pdf, other]

Tomography of Quantum States from Structured Measurements via quantum-aware transformer

Authors: Hailan Ma, Zhenhong Sun, Daoyi Dong, Chunlin Chen, Herschel Rabitz

Abstract: Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a q… ▽ More Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a quantum state has been neglected in previous work. In this paper, we explore the similarity between highly structured sentences in natural language and intrinsically structured measurements in QST. To fully leverage the intrinsic quantum characteristics involved in QST, we design a quantum-aware transformer (QAT) model to capture the complex relationship between measured frequencies and density matrices. In particular, we query quantum operators in the architecture to facilitate informative representations of quantum data and integrate the Bures distance into the loss function to evaluate quantum state fidelity, thereby enabling the reconstruction of quantum states from measured data with high fidelity. Extensive simulations and experiments (on IBM quantum computers) demonstrate the superiority of the QAT in reconstructing quantum states with favorable robustness against experimental noise. △ Less

Submitted 17 November, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.00385 [pdf]

Cross-Shaped Windows Transformer with Self-supervised Pretraining for Clinically Significant Prostate Cancer Detection in Bi-parametric MRI

Authors: Yuheng Li, Jacob Wynne, Jing Wang, Richard L. J. Qiu, Justin Roper, Shaoyan Pan, Ashesh B. Jani, Tian Liu, Pretesh R. Patel, Hui Mao, Xiaofeng Yang

Abstract: Biparametric magnetic resonance imaging (bpMRI) has demonstrated promising results in prostate cancer (PCa) detection using convolutional neural networks (CNNs). Recently, transformers have achieved competitive performance compared to CNNs in computer vision. Large scale transformers need abundant annotated data for training, which are difficult to obtain in medical imaging. Self-supervised learni… ▽ More Biparametric magnetic resonance imaging (bpMRI) has demonstrated promising results in prostate cancer (PCa) detection using convolutional neural networks (CNNs). Recently, transformers have achieved competitive performance compared to CNNs in computer vision. Large scale transformers need abundant annotated data for training, which are difficult to obtain in medical imaging. Self-supervised learning (SSL) utilizes unlabeled data to generate meaningful semantic representations without the need for costly annotations, enhancing model performance on tasks with limited labeled data. We introduce a novel end-to-end Cross-Shaped windows (CSwin) transformer UNet model, CSwin UNet, to detect clinically significant prostate cancer (csPCa) in prostate bi-parametric MR imaging (bpMRI) and demonstrate the effectiveness of our proposed self-supervised pre-training framework. Using a large prostate bpMRI dataset with 1500 patients, we first pretrain CSwin transformer using multi-task self-supervised learning to improve data-efficiency and network generalizability. We then finetune using lesion annotations to perform csPCa detection. Five-fold cross validation shows that self-supervised CSwin UNet achieves 0.888 AUC and 0.545 Average Precision (AP), significantly outperforming four comparable models (Swin UNETR, DynUNet, Attention UNet, UNet). Using a separate bpMRI dataset with 158 patients, we evaluate our method robustness to external hold-out data. Self-supervised CSwin UNet achieves 0.79 AUC and 0.45 AP, still outperforming all other comparable methods and demonstrating good generalization to external data. △ Less

Submitted 17 March, 2024; v1 submitted 30 April, 2023; originally announced May 2023.

arXiv:2305.00042 [pdf]

Cycle-guided Denoising Diffusion Probability Model for 3D Cross-modality MRI Synthesis

Authors: Shaoyan Pan, Chih-Wei Chang, Junbo Peng, Jiahan Zhang, Richard L. J. Qiu, Tonghe Wang, Justin Roper, Tian Liu, Hui Mao, Xiaofeng Yang

Abstract: This study aims to develop a novel Cycle-guided Denoising Diffusion Probability Model (CG-DDPM) for cross-modality MRI synthesis. The CG-DDPM deploys two DDPMs that condition each other to generate synthetic images from two different MRI pulse sequences. The two DDPMs exchange random latent noise in the reverse processes, which helps to regularize both DDPMs and generate matching images in two mod… ▽ More This study aims to develop a novel Cycle-guided Denoising Diffusion Probability Model (CG-DDPM) for cross-modality MRI synthesis. The CG-DDPM deploys two DDPMs that condition each other to generate synthetic images from two different MRI pulse sequences. The two DDPMs exchange random latent noise in the reverse processes, which helps to regularize both DDPMs and generate matching images in two modalities. This improves image-to-image translation ac-curacy. We evaluated the CG-DDPM quantitatively using mean absolute error (MAE), multi-scale structural similarity index measure (MSSIM), and peak sig-nal-to-noise ratio (PSNR), as well as the network synthesis consistency, on the BraTS2020 dataset. Our proposed method showed high accuracy and reliable consistency for MRI synthesis. In addition, we compared the CG-DDPM with several other state-of-the-art networks and demonstrated statistically significant improvements in the image quality of synthetic MRIs. The proposed method enhances the capability of current multimodal MRI synthesis approaches, which could contribute to more accurate diagnosis and better treatment planning for patients by synthesizing additional MRI modalities. △ Less

Submitted 28 April, 2023; originally announced May 2023.

arXiv:2304.13471 [pdf, other]

OPDN: Omnidirectional Position-aware Deformable Network for Omnidirectional Image Super-Resolution

Authors: Xiaopeng Sun, Weiqi Li, Zhenyu Zhang, Qiufang Ma, Xuhan Sheng, Ming Cheng, Haoyu Ma, Shijie Zhao, Jian Zhang, Junlin Li, Li Zhang

Abstract: 360° omnidirectional images have gained research attention due to their immersive and interactive experience, particularly in AR/VR applications. However, they suffer from lower angular resolution due to being captured by fisheye lenses with the same sensor size for capturing planar images. To solve the above issues, we propose a two-stage framework for 360° omnidirectional image superresolution.… ▽ More 360° omnidirectional images have gained research attention due to their immersive and interactive experience, particularly in AR/VR applications. However, they suffer from lower angular resolution due to being captured by fisheye lenses with the same sensor size for capturing planar images. To solve the above issues, we propose a two-stage framework for 360° omnidirectional image superresolution. The first stage employs two branches: model A, which incorporates omnidirectional position-aware deformable blocks (OPDB) and Fourier upsampling, and model B, which adds a spatial frequency fusion module (SFF) to model A. Model A aims to enhance the feature extraction ability of 360° image positional information, while Model B further focuses on the high-frequency information of 360° images. The second stage performs same-resolution enhancement based on the structure of model A with a pixel unshuffle operation. In addition, we collected data from YouTube to improve the fitting ability of the transformer, and created pseudo low-resolution images using a degradation network. Our proposed method achieves superior performance and wins the NTIRE 2023 challenge of 360° omnidirectional image super-resolution. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Comments: Accepted to CVPRW 2023

arXiv:2304.11374 [pdf, ps, other]

Towards Carbon-Neutral Edge Computing: Greening Edge AI by Harnessing Spot and Future Carbon Markets

Authors: Huirong Ma, Zhi Zhou, Xiaoxi Zhang, Xu Chen

Abstract: Provisioning dynamic machine learning (ML) inference as a service for artificial intelligence (AI) applications of edge devices faces many challenges, including the trade-off among accuracy loss, carbon emission, and unknown future costs. Besides, many governments are launching carbon emission rights (CER) for operators to reduce carbon emissions further to reverse climate change. Facing these cha… ▽ More Provisioning dynamic machine learning (ML) inference as a service for artificial intelligence (AI) applications of edge devices faces many challenges, including the trade-off among accuracy loss, carbon emission, and unknown future costs. Besides, many governments are launching carbon emission rights (CER) for operators to reduce carbon emissions further to reverse climate change. Facing these challenges, to achieve carbon-aware ML task offloading under limited carbon emission rights thus to achieve green edge AI, we establish a joint ML task offloading and CER purchasing problem, intending to minimize the accuracy loss under the long-term time-averaged cost budget of purchasing the required CER. However, considering the uncertainty of the resource prices, the CER purchasing prices, the carbon intensity of sites, and ML tasks' arrivals, it is hard to decide the optimal policy online over a long-running period time. To overcome this difficulty, we leverage the two-timescale Lyapunov optimization technique, of which the $T$-slot drift-plus-penalty methodology inspires us to propose an online algorithm that purchases CER in multiple timescales (on-preserved in carbon future market and on-demanded in the carbon spot market) and makes decisions about where to offload ML tasks. Considering the NP-hardness of the $T$-slot problems, we further propose the resource-restricted randomized dependent rounding algorithm to help to gain the near-optimal solution with no help of any future information. Our theoretical analysis and extensive simulation results driven by the real carbon intensity trace show the superior performance of the proposed algorithms. △ Less

Submitted 22 April, 2023; originally announced April 2023.

Comments: Accepted by IEEE Internet of Things Journal, 2023

arXiv:2303.15944 [pdf, other]

doi 10.1109/LSP.2023.3280851

Cluster-Guided Unsupervised Domain Adaptation for Deep Speaker Embedding

Authors: Haiquan Mao, Feng Hong, Man-wai Mak

Abstract: Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled tar… ▽ More Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled target domain data to train a speaker embedding network. To improve the cluster quality, we train a speaker embedding network dedicated for clustering by minimizing the contrastive center loss. The goal is to reduce the distance between an embedding and its assigned cluster center while enlarging the distance between the embedding and the other cluster centers. Using VoxCeleb2 as the source domain and CN-Celeb1 as the target domain, we demonstrate that the proposed method can achieve an equal error rate (EER) of 8.10% on the CN-Celeb1 evaluation set without using any labels from the target domain. This result outperforms the supervised baseline by 39.6% and is the state-of-the-art UDA performance on this corpus. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.11543 [pdf, other]

DeepMA: End-to-end Deep Multiple Access for Wireless Image Transmission in Semantic Communication

Authors: Wenyu Zhang, Kaiyuan Bai, Sherali Zeadally, Haijun Zhang, Hua Shao, Hui Ma, Victor C. M. Leung

Abstract: Semantic communication is a new paradigm that exploits deep learning models to enable end-to-end communications processes, and recent studies have shown that it can achieve better noise resiliency compared with traditional communication schemes in a low signal-to-noise (SNR) regime. To achieve multiple access in semantic communication, we propose a deep learning-based multiple access (DeepMA) meth… ▽ More Semantic communication is a new paradigm that exploits deep learning models to enable end-to-end communications processes, and recent studies have shown that it can achieve better noise resiliency compared with traditional communication schemes in a low signal-to-noise (SNR) regime. To achieve multiple access in semantic communication, we propose a deep learning-based multiple access (DeepMA) method by training semantic communication models with the abilities of joint source-channel coding (JSCC) and orthogonal signal modulation. DeepMA is achieved by a DeepMA network (DMANet), which is comprised of several independent encoder-decoder pairs (EDPs), and the DeepMA encoders can encode the input data as mutually orthogonal semantic symbol vectors (SSVs) such that the DeepMA decoders can detect and recover their own target data from a received mixed SSV (MSSV) superposed by multiple SSV components transmitted from different encoders. We describe frameworks of DeepMA in wireless device-to-device (D2D), downlink, and uplink channel multiplexing scenarios, along with the training algorithm. We evaluate the performance of the proposed DeepMA in wireless image transmission tasks and compare its performance with the attention module-based deep JSCC (ADJSCC) method and conventional communication schemes using better portable graphics (BPG) and Low-density parity-check code (LDPC). The results obtained show that the proposed DeepMA can achieve effective, flexible, and privacy-preserving channel multiplexing process, and demonstrate that our proposed DeepMA approach can yield comparable bandwidth efficiency compared with conventional multiple access schemes. △ Less

Submitted 27 June, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

arXiv:2303.08671 [pdf, other]

A Dual-Cluster-Head Based Medium Access Control for Large-Scale UAV Ad-Hoc Networks

Authors: Xinru Zhao, Zhiqing Wei, Yingying Zou, Hao Ma, Yanpeng Cui, Zhiyong Feng

Abstract: Unmanned Aerial Vehicle (UAV) ad hoc network has achieved significant growth for its flexibility, extensibility, and high deployability in recent years. The application of clustering scheme for UAV ad hoc network is imperative to enhance the performance of throughput and energy efficiency. In conventional clustering scheme, a single cluster head (CH) is always assigned in each cluster. However, th… ▽ More Unmanned Aerial Vehicle (UAV) ad hoc network has achieved significant growth for its flexibility, extensibility, and high deployability in recent years. The application of clustering scheme for UAV ad hoc network is imperative to enhance the performance of throughput and energy efficiency. In conventional clustering scheme, a single cluster head (CH) is always assigned in each cluster. However, this method has some weaknesses such as overload and premature death of CH when the number of UAVs increased. In order to solve this problem, we propose a dual-cluster-head based medium access control (DCHMAC) scheme for large-scale UAV networks. In DCHMAC, two CHs are elected to manage resource allocation and data forwarding cooperatively. Specifically, two CHs work on different channels. One of CH is used for intra-cluster communication and the other one is for inter-cluster communication. A Markov chain model is developed to analyse the throughput of the network. Simulation result shows that compared with FM-MAC (flying ad hoc networks multi-channel MAC,FM-MAC), DCHMAC improves the throughput by approximately 20%-50% and prolongs the network lifetime by approximately 40%. △ Less

Submitted 26 February, 2023; originally announced March 2023.

Comments: 10 pages, 12 figures, journal

arXiv:2302.14312 [pdf, other]

Auxiliary Task-based Deep Reinforcement Learning for Quantum Control

Authors: Shumin Zhou, Hailan Ma, Sen Kuang, Daoyi Dong

Abstract: Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement… ▽ More Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement learning (AT-DRL) for quantum control. In particular, we first design a guided reward function based on the fidelity of quantum states that enables incremental fidelity improvement. Then, we introduce the concept of an auxiliary task whose network shares parameters with the main network to predict the reward provided by the environment (called the main task). The auxiliary task learns synchronously with the main task, allowing one to select the most relevant features of the environment, thus aiding the agent in comprehending how to achieve the desired state. The numerical simulations demonstrate that the proposed AT-DRL can provide a solution to the sparse reward in quantum systems, and has great potential in designing control pulses that achieve efficient quantum state preparation. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: 13 pages, 11 figures

arXiv:2212.07009 [pdf]

Piston sensing for sparse aperture systems via all-optical diffractive neural network

Authors: Xiafei Ma, Zongliang Xie, Haotong Ma, Ge Ren

Abstract: It is a crucial issue to realize real-time piston correction in the area of sparse aperture imaging. This paper introduces an optical diffractive neural network-based piston sensing method, which can achieve light-speed sensing. By using detectable intensity to represent pistons, the proposed method is capable of converting complex amplitude distribution of the imaging optical field into piston va… ▽ More It is a crucial issue to realize real-time piston correction in the area of sparse aperture imaging. This paper introduces an optical diffractive neural network-based piston sensing method, which can achieve light-speed sensing. By using detectable intensity to represent pistons, the proposed method is capable of converting complex amplitude distribution of the imaging optical field into piston values directly. Differing from the electrical neural network, the way of intensity representation enables the method to obtain the predicted pistons without imaging acquisition and electrical processing process. The simulations demonstrate the feasibility of the method for point source, and high accuracies are achieved for both monochromatic light and broadband light. This method can greatly improve the real-time performance of piston sensing and contribute to the development of the sparse aperture system. △ Less

Submitted 29 June, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: 5 pages, 6 figures

arXiv:2211.13119 [pdf, other]

Performance of Cooperative Detection in Joint Communication-Sensing Vehicular Network: A Data Analytic and Stochastic Geometry Approach

Authors: Hao Ma, Zhiqing Wei, Zening Li, Fan Ning, Xu Chen, Zhiyong Feng

Abstract: The increasing complexity of urban environments introduces additional uncertainty to the deployment of the autonomous vehicular network. A novel road infrastructure cooperative detection model using Joint Communication and Sensing (JCS) technology is proposed in this article to simultaneously achieve high-efficient communication and obstacle detection for urban autonomous vehicles. To suppress the… ▽ More The increasing complexity of urban environments introduces additional uncertainty to the deployment of the autonomous vehicular network. A novel road infrastructure cooperative detection model using Joint Communication and Sensing (JCS) technology is proposed in this article to simultaneously achieve high-efficient communication and obstacle detection for urban autonomous vehicles. To suppress the performance fluctuation caused by shadowing and obstruction to the JCS signals, we first derive the statistic of road obstacles from the Geographic Information System (GIS). Then, the analysis of JCS channel characteristics and shadowing factors are presented using Line-of-Sight and Non-Line-of-Sight (LoS and NLoS) channel models under the complex urban scenario. A stochastic geometry approach is applied to analyze the interference factors and the probability distribution of successful JCS detection and communication. Simulations have been made to verify the cooperative detection model by probability analysis based on LoS and NLoS channels, and the numerical results demonstrate several different optimization methods for the deployment of JCS road infrastructures. Finally, we simulated and analyzed a deployment optimization method for JCS road infrastructures that complied with the standard of urban traffic-spot structure placement. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2211.12080 [pdf, other]

Robust Training for Speaker Verification against Noisy Labels

Authors: Zhihua Fang, Liang He, Hanhan Ma, Xiaochen Guo, Lin Li

Abstract: The deep learning models used for speaker verification rely heavily on large amounts of data and correct labeling. However, noisy (incorrect) labels often occur, which degrades the performance of the system. In this paper, we propose a novel two-stage learning method to filter out noisy labels from speaker datasets. Since a DNN will first fit data with clean labels, we first train the model with a… ▽ More The deep learning models used for speaker verification rely heavily on large amounts of data and correct labeling. However, noisy (incorrect) labels often occur, which degrades the performance of the system. In this paper, we propose a novel two-stage learning method to filter out noisy labels from speaker datasets. Since a DNN will first fit data with clean labels, we first train the model with all data for several epochs. Then, based on this model, the model predictions are compared with the labels using our proposed the OR-Gate with top-k mechanism to select the data with clean labels and the selected data is used to train the model. This process is iterated until the training is completed. We have demonstrated the effectiveness of this method in filtering noisy labels through extensive experiments and have achieved excellent performance on the VoxCeleb (1 and 2) with different added noise rates. △ Less

Submitted 25 May, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted by INTERSPEECH 2023

arXiv:2211.06769 [pdf, other]

Realistic Bokeh Effect Rendering on Mobile GPUs, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Jin Zhang, Feng Zhang, Gaocheng Yu, Zhe Ma, Hongbin Wang, Minsu Kwon, Haotian Qian, Wentao Tong, Pan Mu, Ziping Wang, Guangjing Yan, Brian Lee, Lei Fei, Huaijin Chen, Hyebin Cho, Byeongjun Kwon, Munchurl Kim, Mingyang Qian, Huixin Ma, Yanan Li, Xiaotao Wang, Lei Lei

Abstract: As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB!… ▽ More As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000's Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.03885; text overlap with arXiv:2105.07809, arXiv:2211.04470, arXiv:2211.05256, arXiv:2211.05910

arXiv:2211.06073 [pdf, other]

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu

Abstract: Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society i… ▽ More Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper proposes such a dataset for scene fake audio detection named SceneFake, where a manipulated audio is generated by only tampering with the acoustic scene of an real utterance by using speech enhancement technologies. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper. In addition, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented in this paper. The results indicate that scene fake utterances cannot be reliably detected by baseline models trained on the ASVspoof 2019 dataset. Although these models perform well on the SceneFake training set and seen testing set, their performance is poor on the unseen test set. The dataset (https://zenodo.org/record/7663324#.Y_XKMuPYuUk) and benchmark source codes (https://github.com/ADDchallenge/SceneFake) are publicly available. △ Less

Submitted 4 April, 2024; v1 submitted 11 November, 2022; originally announced November 2022.

Comments: Accepted by Pattern Recognition, 1 April 2024

arXiv:2210.15122 [pdf]

Experimental Comparison of SNR and RSSI for LoRa-ESL Based on Machine Clustering and Arithmetic Distribution

Authors: Malak Abid Ali Khan, Hongbin Ma, Syed Muhammad Aamir, Cekderi Anil Baris

Abstract: LoRa lacks the sensing capabilities of channel status. Received signal strength indicator (RSSI) decreases due to collision, interference, and near-far effect while for signal-to-noise ratio (SNR), the packets are rejected by decreasing the transmission power (TP) at a higher spreading factor (SF). To overcome these challenges in the case of electric shelf label (ESL) to minimize the dependency on… ▽ More LoRa lacks the sensing capabilities of channel status. Received signal strength indicator (RSSI) decreases due to collision, interference, and near-far effect while for signal-to-noise ratio (SNR), the packets are rejected by decreasing the transmission power (TP) at a higher spreading factor (SF). To overcome these challenges in the case of electric shelf label (ESL) to minimize the dependency on retransmission and acknowledgment, the end devices (EDs) are allocated around gateways (GWs) based on machine clustering with dynamic SF for SNR while dynamic TP for RSSI. The experimental results determined that the RSSI approach is more dominant than SNR because of determining the exact locality of the ED that diminished the capture effect. Arithmetic distribution of EDs for various GWs in different clusters helps to minify the near-far effect. The resultant received power (RP) at each cluster is higher for most of the connected EDs than the threshold RP. △ Less

Submitted 13 December, 2022; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.07553 [pdf, other]

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Authors: Dongjie Yu, Wenjun Zou, Yujie Yang, Haitong Ma, Shengbo Eben Li, Jingliang Duan, Jianyu Chen

Abstract: Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an i… ▽ More Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: 12 pages, 6 figures

arXiv:2209.06823 [pdf]

DEANet: Decomposition Enhancement and Adjustment Network for Low-Light Image Enhancement

Authors: Yonglong Jiang, Liangliang Li, Yuan Xue, Hongbing Ma

Abstract: Images obtained under low-light conditions will seriously affect the quality of the images. Solving the problem of poor low-light image quality can effectively improve the visual quality of images and better improve the usability of computer vision. In addition, it has very important applications in many fields. This paper proposes a DEANet based on Retinex for low-light image enhancement. It comb… ▽ More Images obtained under low-light conditions will seriously affect the quality of the images. Solving the problem of poor low-light image quality can effectively improve the visual quality of images and better improve the usability of computer vision. In addition, it has very important applications in many fields. This paper proposes a DEANet based on Retinex for low-light image enhancement. It combines the frequency information and content information of the image into three sub-networks: decomposition network, enhancement network and adjustment network. These three sub-networks are respectively used for decomposition, denoising, contrast enhancement and detail preservation, adjustment, and image generation. Our model has good robust results for all low-light images. The model is trained on the public data set LOL, and the experimental results show that our method is better than the existing state-of-the-art methods in terms of vision and quality. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: 8 pages, 7 figures

arXiv:2209.02604 [pdf, other]

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

Authors: Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang, Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, Kai Gao

Abstract: Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Un… ▽ More Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments with both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables further research for discovering emotion-bearing acoustic and visual cues and paves the path to interpretable end-to-end HCI applications for real-world scenarios. △ Less

Submitted 21 August, 2022; originally announced September 2022.

Comments: 16pages, 7 figures, accepted by ICMI 2022

arXiv:2208.09646 [pdf, other]

doi 10.1145/3552466.3556525

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu

Abstract: Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesize… ▽ More Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesized by eight state-of-the-art vocoders. We have preliminarily explored the features and model architectures. The t-SNE visualization shows that different vocoders generate distinct vocoder fingerprints. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for Audio Multimedia

arXiv:2208.09618 [pdf, other]

Fully Automated End-to-End Fake Audio Detection

Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

Abstract: The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toen… ▽ More The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system. △ Less

Submitted 20 August, 2022; originally announced August 2022.

arXiv:2208.05163 [pdf, other]

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Authors: Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang

Abstract: Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, thi… ▽ More Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for DeiT-base. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: Published in FPL2022

arXiv:2207.12308 [pdf, other]

CFAD: A Chinese Dataset for Fake Audio Detection

Authors: Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xinrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Ruibo Fu

Abstract: Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake… ▽ More Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available at: https://zenodo.org/record/8122764. △ Less

Submitted 18 July, 2023; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: FAD renamed as CFAD

arXiv:2206.07997 [pdf, ps, other]

Reconfigurable Intelligent Surface-aided $M$-ary FM-DCSK System: a New Design for Noncoherent Chaos-based Communication

Authors: Huan Ma, Yi Fang, Pingping Chen, Yonghui Li

Abstract: In this paper, we propose two reconfigurable intelligent surface-aided $M$-ary frequency-modulated differential chaos shift keying (RIS-$M$-FM-DCSK) schemes. In scheme I, the RIS is regarded as a transmitter at the source to incorporate the $M$-ary phase-shift-keying ($M$-PSK) symbols into the FM chaotic signal and to reflect the resultant $M$-ary FM chaotic signal toward the destination. The info… ▽ More In this paper, we propose two reconfigurable intelligent surface-aided $M$-ary frequency-modulated differential chaos shift keying (RIS-$M$-FM-DCSK) schemes. In scheme I, the RIS is regarded as a transmitter at the source to incorporate the $M$-ary phase-shift-keying ($M$-PSK) symbols into the FM chaotic signal and to reflect the resultant $M$-ary FM chaotic signal toward the destination. The information bits of the source are carried by both the positive/negative state of the FM chaotic signal and the $M$-PSK symbols. In scheme II, the RIS is treated as a relay so that both the source and relay can simultaneously transmit their information bits to the destination. The information bits of the source and relay are carried by the positive/negative state of the FM chaotic signal and $M$-PSK symbols generated by the RIS, respectively. The proposed RIS-$M$-FM-DCSK system has an attractive advantage that it does not require channel state information for detection, thus avoiding complex channel estimation. Moreover, we derive the theoretical expressions for bit error rates (BERs) of the proposed RIS-$M$-FM-DCSK system with both scheme I and scheme II over multipath Rayleigh fading channels. Simulations results not only verify the accuracy of the theoretical derivations, but also demonstrate the superiority of the proposed system. The proposed RIS-$M$-FM-DCSK system is a promising low-cost, low-power, and high-reliability alternative for wireless communication networks. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2205.06971 [pdf, ps, other]

doi 10.1109/TCOMM.2023.3239647

Design of a Reconfigurable Intelligent Surface-Assisted FM-DCSK-SWIPT Scheme with Non-linear Energy Harvesting Model

Authors: Yi Fang, Yiwei Tao, Huan Ma, Yonghui Li, Mohsen Guizani

Abstract: In this paper, we propose a reconfigurable intelligent surface (RIS)-assisted frequency-modulated (FM) differential chaos shift keying (DCSK) scheme with simultaneous wireless information and power transfer (SWIPT), called RIS-FM-DCSK-SWIPT scheme, for low-power, low-cost, and high-reliability wireless communication networks. In particular, the proposed scheme is developed under a non-linear energ… ▽ More In this paper, we propose a reconfigurable intelligent surface (RIS)-assisted frequency-modulated (FM) differential chaos shift keying (DCSK) scheme with simultaneous wireless information and power transfer (SWIPT), called RIS-FM-DCSK-SWIPT scheme, for low-power, low-cost, and high-reliability wireless communication networks. In particular, the proposed scheme is developed under a non-linear energy-harvesting (EH) model which can accurately characterize the practical situation. The proposed RIS-FM-DCSK-SWIPT scheme has an appealing feature that it does not require channel state information, thus avoiding the complex channel estimation. We further derive the closed-form theoretical expressions for the energy shortage probability and bit error rate (BER) of the proposed scheme over the multipath Rayleigh fading channel. In addition, we investigate the influence of key parameters on the performance of the proposed transmission scheme in two different scenarios, i.e., RIS-assisted access point (RIS-AP) and dual-hop communication (RIS-DH). Finally, we carry out various Monte-Carlo experiments to verify the accuracy of the theoretical derivation, illustrate the performance advantage of the proposed scheme, and give some design insights for future study. △ Less

Submitted 14 March, 2023; v1 submitted 14 May, 2022; originally announced May 2022.

Comments: accepted by IEEE Transactions on Communications

Showing 1–50 of 94 results for author: Ma, H