Search | arXiv e-print repository

Electrically Reconfigurable Non-Volatile On-Chip Bragg Filter with Multilevel Operation

Authors: Amged Alquliah, Jay Ke-Chieh Sun, Christopher Mekhiel, Chengkuan Gao, Guli Gulinihali, Yeshaiahu Fainman, Abdoulaye Ndao

Abstract: Photonic integrated circuits (PICs) demand tailored spectral responses for various applications. On-chip Bragg filters offer a promising solution, yet their static nature hampers scalability. Current tunable filters rely on volatile switching mechanisms plagued by high static power consumption and thermal crosstalk. Here, we introduce, for the first time, a non-volatile, electrically programmable… ▽ More Photonic integrated circuits (PICs) demand tailored spectral responses for various applications. On-chip Bragg filters offer a promising solution, yet their static nature hampers scalability. Current tunable filters rely on volatile switching mechanisms plagued by high static power consumption and thermal crosstalk. Here, we introduce, for the first time, a non-volatile, electrically programmable on-chip Bragg filter. This device incorporates a nanoscale layer of wide-bandgap phase change material (Sb2S3) atop a periodically structured silicon waveguide. The reversible phase transitions and drastic refractive index modulation of Sb2S3 enable dynamic spectral tuning via foundry-compatible microheaters. Our design surpasses traditional passive Bragg gratings and active volatile filters by offering electrically controlled, reconfigurable spectral responses in a non-volatile manner. The proposed filter achieves a peak reflectivity exceeding 99% and a high tuning range ($Δλ$=20 nm) when transitioning between the amorphous and crystalline states of Sb2S3. Additionally, we demonstrate quasi-continuous spectral control of the filter stopband by modulating the amorphous/crystalline distribution within Sb2S3. Our approach offers substantial benefits for low-power, programmable PICs, thereby laying the groundwork for prospective applications in optical communications, optical interconnects, microwave photonics, optical signal processing, and adaptive multi-parameter sensing. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: 20 pages, 4 figures,

arXiv:2407.11172 [pdf]

Micro-Ring Modulator Linearity Enhancement for Analog and Digital Optical Links

Authors: Sumilak Chaudhury, Karl Johnson, Chengkuan Gao, Bill Lin, Yeshaiahu Fainman, Tzu-Chien Hsueh

Abstract: An energy/area-efficient low-cost broadband linearity enhancement technique for electro-optic micro-ring modulators (MRM) is proposed to achieve 6.1-dB dynamic linearity improvement in spurious-free-dynamic-range with intermodulation distortions (IMD) and 17.9-dB static linearity improvement in integral nonlinearity over a conventional notch-filter MRM within a 4.8-dB extinction-ratio (ER) full-sc… ▽ More An energy/area-efficient low-cost broadband linearity enhancement technique for electro-optic micro-ring modulators (MRM) is proposed to achieve 6.1-dB dynamic linearity improvement in spurious-free-dynamic-range with intermodulation distortions (IMD) and 17.9-dB static linearity improvement in integral nonlinearity over a conventional notch-filter MRM within a 4.8-dB extinction-ratio (ER) full-scale range based on rapid silicon-photonics fabrication results for the emerging applications of various analog and digital optical communication systems. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 4 pages, 5 figures

arXiv:2407.08681 [pdf, other]

Hardware Neural Control of CartPole and F1TENTH Race Car

Authors: Marcin Paluch, Florian Bolli, Xiang Deng, Antonio Rios Navarro, Chang Gao, Tobi Delbruck

Abstract: Nonlinear model predictive control (NMPC) has proven to be an effective control method, but it is expensive to compute. This work demonstrates the use of hardware FPGA neural network controllers trained to imitate NMPC with supervised learning. We use these Neural Controllers (NCs) implemented on inexpensive embedded FPGA hardware for high frequency control on physical cartpole and F1TENTH race ca… ▽ More Nonlinear model predictive control (NMPC) has proven to be an effective control method, but it is expensive to compute. This work demonstrates the use of hardware FPGA neural network controllers trained to imitate NMPC with supervised learning. We use these Neural Controllers (NCs) implemented on inexpensive embedded FPGA hardware for high frequency control on physical cartpole and F1TENTH race car. Our results show that the NCs match the control performance of the NMPCs in simulation and outperform it in reality, due to the faster control rate that is afforded by the quick FPGA NC inference. We demonstrate kHz control rates for a physical cartpole and offloading control to the FPGA hardware on the F1TENTH car. Code and hardware implementation for this paper are available at https:// github.com/SensorsINI/Neural-Control-Tools. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.04051 [pdf, other]

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. △ Less

Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Work in progress. Authors are listed in alphabetical order by family name

arXiv:2407.03361 [pdf, ps, other]

PianoBART: Symbolic Piano Music Generation and Understanding with Large-Scale Pre-Training

Authors: Xiao Liang, Zijian Zhao, Weichao Zeng, Yutong He, Fupeng He, Yiyi Wang, Chengying Gao

Abstract: Learning musical structures and composition patterns is necessary for both music generation and understanding, but current methods do not make uniform use of learned features to generate and comprehend music simultaneously. In this paper, we propose PianoBART, a pre-trained model that uses BART for both symbolic piano music generation and understanding. We devise a multi-level object selection str… ▽ More Learning musical structures and composition patterns is necessary for both music generation and understanding, but current methods do not make uniform use of learned features to generate and comprehend music simultaneously. In this paper, we propose PianoBART, a pre-trained model that uses BART for both symbolic piano music generation and understanding. We devise a multi-level object selection strategy for different pre-training tasks of PianoBART, which can prevent information leakage or loss and enhance learning ability. The musical semantics captured in pre-training are fine-tuned for music generation and understanding tasks. Experiments demonstrate that PianoBART efficiently learns musical patterns and achieves outstanding performance in generating high-quality coherent pieces and comprehending music. Our code and supplementary material are available at https://github.com/RS2002/PianoBart. △ Less

Submitted 25 June, 2024; originally announced July 2024.

arXiv:2405.03905 [pdf, other]

A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM

Authors: Qinyu Chen, Kwantae Kim, Chang Gao, Sheng Zhou, Taekwang Jang, Tobi Delbruck, Shih-Chii Liu

Abstract: This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network (ΔRNN) cla… ▽ More This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network (ΔRNN) classifier, achieves an 11-class Google Speech Command Dataset (GSCD) KWS accuracy of 90.5% and energy consumption of 36nJ/decision. At 87% temporal sparsity, computing latency and energy per inference are reduced by 2.4$\times$/3.4$\times$, respectively. The 65nm design occupies 0.78mm$^2$ and features two additional blocks, a compact 0.084mm$^2$ digital infinite-impulse-response (IIR)-based band-pass filter (BPF) audio feature extractor (FEx) and a 24kB 0.6V near-Vth weight SRAM with 6.6$\times$ lower read power compared to the standard SRAM. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.15364 [pdf, other]

doi 10.1109/LMWT.2024.3386330

MP-DPD: Low-Complexity Mixed-Precision Neural Networks for Energy-Efficient Digital Predistortion of Wideband Power Amplifiers

Authors: Yizhuo Wu, Ang Li, Mohammadreza Beikmirza, Gagan Deep Singh, Qinyu Chen, Leo C. N. de Vreede, Morteza Alavi, Chang Gao

Abstract: Digital Pre-Distortion (DPD) enhances signal quality in wideband RF power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep Neural Networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This paper introduces open-source mixed-precision (MP)… ▽ More Digital Pre-Distortion (DPD) enhances signal quality in wideband RF power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep Neural Networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This paper introduces open-source mixed-precision (MP) neural networks that employ quantized low-precision fixed-point parameters for energy-efficient DPD. This approach reduces computational complexity and memory footprint, thereby lowering power consumption without compromising linearization efficacy. Applied to a 160MHz-BW 1024-QAM OFDM signal from a digital RF PA, MP-DPD gives no performance loss against 32-bit floating-point precision DPDs, while achieving -43.75 (L)/-45.27 (R) dBc in Adjacent Channel Power Ratio (ACPR) and -38.72 dB in Error Vector Magnitude (EVM). A 16-bit fixed-point-precision MP-DPD enables a 2.8X reduction in estimated inference power. The PyTorch learning and testing code is publicly available at \url{https://github.com/lab-emi/OpenDPD}. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted to IEEE Microwave and Wireless Technology Letters (MWTL)

arXiv:2404.01479 [pdf, other]

Information Processing in Hybrid Photonic Electrical Reservoir Computing

Authors: Prabhav Gaur, Chengkuan Gao, Karl Johnson, Shimon Rubin, Yeshaiahu Fainman, Tzu-Chien Hsueh

Abstract: Physical Reservoir Computing (PRC) is a recently developed variant of Neuromorphic Computing, where a pertinent physical system effectively projects information encoded in the input signal into a higher-dimensional space. While various physical hardware has demonstrated promising results for Reservoir Computing (RC), systems allowing tunability of their dynamical regimes have not received much att… ▽ More Physical Reservoir Computing (PRC) is a recently developed variant of Neuromorphic Computing, where a pertinent physical system effectively projects information encoded in the input signal into a higher-dimensional space. While various physical hardware has demonstrated promising results for Reservoir Computing (RC), systems allowing tunability of their dynamical regimes have not received much attention regarding how to optimize relevant system parameters. In this work we employ hybrid photonic-electronic (HPE) system offering both parallelism inherent to light propagation, and electronic memory and programmable feedback allowing to induce nonlinear dynamics and tunable encoding of the photonic signal to realize HPE-RC. Specifically, we experimentally and theoretically analyze performance of integrated silicon photonic on-chip Mach-Zehnder interferometer and ring resonators with heaters acting as programmable phase modulators, controlled by detector and the feedback unit capable of realizing complex temporal dynamics of the photonic signal. Furthermore, we present an algorithm capable of predicting optimal parameters for RC by analyzing the corresponding Lyapunov exponent of the output signal and mutual information of reservoir nodes. By implementing the derived optimal parameters, we demonstrate that the corresponding resulting error of RC can be lowered by several orders of magnitude compared to a reservoir operating with randomly chosen set of parameters. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.18992 [pdf]

Tractography with T1-weighted MRI and associated anatomical constraints on clinical quality diffusion MRI

Authors: Tian Yu, Yunhe Li, Michael E. Kim, Chenyu Gao, Qi Yang, Leon Y. Cai, Susane M. Resnick, Lori L. Beason-Held, Daniel C. Moyer, Kurt G. Schilling, Bennett A. Landman

Abstract: Diffusion MRI (dMRI) streamline tractography, the gold standard for in vivo estimation of brain white matter (WM) pathways, has long been considered indicative of macroscopic relationships with WM microstructure. However, recent advances in tractography demonstrated that convolutional recurrent neural networks (CoRNN) trained with a teacher-student framework have the ability to learn and propagate… ▽ More Diffusion MRI (dMRI) streamline tractography, the gold standard for in vivo estimation of brain white matter (WM) pathways, has long been considered indicative of macroscopic relationships with WM microstructure. However, recent advances in tractography demonstrated that convolutional recurrent neural networks (CoRNN) trained with a teacher-student framework have the ability to learn and propagate streamlines directly from T1 and anatomical contexts. Training for this network has previously relied on high-resolution dMRI. In this paper, we generalize the training mechanism to traditional clinical resolution data, which allows generalizability across sensitive and susceptible study populations. We train CoRNN on a small subset of the Baltimore Longitudinal Study of Aging (BLSA), which better resembles clinical protocols. Then, we define a metric, termed the epsilon ball seeding method, to compare T1 tractography and traditional diffusion tractography at the streamline level. Under this metric, T1 tractography generated by CoRNN reproduces diffusion tractography with approximately two millimeters of error. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.05937 [pdf, other]

Wavelet-Like Transform-Based Technology in Response to the Call for Proposals on Neural Network-Based Image Coding

Authors: Cunhui Dong, Haichuan Ma, Haotian Zhang, Changsheng Gao, Li Li, Dong Liu

Abstract: Neural network-based image coding has been developing rapidly since its birth. Until 2022, its performance has surpassed that of the best-performing traditional image coding framework -- H.266/VVC. Witnessing such success, the IEEE 1857.11 working subgroup initializes a neural network-based image coding standard project and issues a corresponding call for proposals (CfP). In response to the CfP, t… ▽ More Neural network-based image coding has been developing rapidly since its birth. Until 2022, its performance has surpassed that of the best-performing traditional image coding framework -- H.266/VVC. Witnessing such success, the IEEE 1857.11 working subgroup initializes a neural network-based image coding standard project and issues a corresponding call for proposals (CfP). In response to the CfP, this paper introduces a novel wavelet-like transform-based end-to-end image coding framework -- iWaveV3. iWaveV3 incorporates many new features such as affine wavelet-like transform, perceptual-friendly quality metric, and more advanced training and online optimization strategies into our previous wavelet-like transform-based framework iWave++. While preserving the features of supporting lossy and lossless compression simultaneously, iWaveV3 also achieves state-of-the-art compression efficiency for objective quality and is very competitive for perceptual quality. As a result, iWaveV3 is adopted as a candidate scheme for developing the IEEE Standard for neural-network-based image coding. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2402.09424 [pdf, other]

Epilepsy Seizure Detection and Prediction using an Approximate Spiking Convolutional Transformer

Authors: Qinyu Chen, Congyi Sun, Chang Gao, Shih-Chii Liu

Abstract: Epilepsy is a common disease of the nervous system. Timely prediction of seizures and intervention treatment can significantly reduce the accidental injury of patients and protect the life and health of patients. This paper presents a neuromorphic Spiking Convolutional Transformer, named Spiking Conformer, to detect and predict epileptic seizure segments from scalped long-term electroencephalogram… ▽ More Epilepsy is a common disease of the nervous system. Timely prediction of seizures and intervention treatment can significantly reduce the accidental injury of patients and protect the life and health of patients. This paper presents a neuromorphic Spiking Convolutional Transformer, named Spiking Conformer, to detect and predict epileptic seizure segments from scalped long-term electroencephalogram (EEG) recordings. We report evaluation results from the Spiking Conformer model using the Boston Children's Hospital-MIT (CHB-MIT) EEG dataset. By leveraging spike-based addition operations, the Spiking Conformer significantly reduces the classification computational cost compared to the non-spiking model. Additionally, we introduce an approximate spiking neuron layer to further reduce spike-triggered neuron updates by nearly 38% without sacrificing accuracy. Using raw EEG data as input, the proposed Spiking Conformer achieved an average sensitivity rate of 94.9% and a specificity rate of 99.3% for the seizure detection task, and 96.8%, 89.5% for the seizure prediction task, and needs >10x fewer operations compared to the non-spiking equivalent model. △ Less

Submitted 21 January, 2024; originally announced February 2024.

Comments: To be published at the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore

arXiv:2402.00803 [pdf, other]

Signal Quality Auditing for Time-series Data

Authors: Chufan Gao, Nicholas Gisolfi, Artur Dubrawski

Abstract: Signal quality assessment (SQA) is required for monitoring the reliability of data acquisition systems, especially in AI-driven Predictive Maintenance (PMx) application contexts. SQA is vital for addressing "silent failures" of data acquisition hardware and software, which when unnoticed, misinform the users of data, creating the risk for incorrect decisions with unintended or even catastrophic co… ▽ More Signal quality assessment (SQA) is required for monitoring the reliability of data acquisition systems, especially in AI-driven Predictive Maintenance (PMx) application contexts. SQA is vital for addressing "silent failures" of data acquisition hardware and software, which when unnoticed, misinform the users of data, creating the risk for incorrect decisions with unintended or even catastrophic consequences. We have developed an open-source software implementation of signal quality indices (SQIs) for the analysis of time-series data. We codify a range of SQIs, demonstrate them using established benchmark data, and show that they can be effective for signal quality assessment. We also study alternative approaches to denoising time-series data in an attempt to improve the quality of the already degraded signal, and evaluate them empirically on relevant real-world data. To our knowledge, our software toolkit is the first to provide an open source implementation of a broad range of signal quality assessment and improvement techniques validated on publicly available benchmark data for ease of reproducibility. The generality of our framework can be easily extended to assessing reliability of arbitrary time-series measurements in complex systems, especially when morphological patterns of the waveform shapes and signal periodicity are of key interest in downstream analyses. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.12440 [pdf, ps, other]

doi 10.1109/ICASSP48485.2024.10447161

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Authors: Chenyang Gao, Brecht Desplanques, Chelsea J. -T. Ju, Aman Chadha, Andreas Stolcke

Abstract: Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtim… ▽ More Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.08318 [pdf, other]

OpenDPD: An Open-Source End-to-End Learning & Benchmarking Framework for Wideband Power Amplifier Modeling and Digital Pre-Distortion

Authors: Yizhuo Wu, Gagan Deep Singh, Mohammadreza Beikmirza, Leo C. N. de Vreede, Morteza Alavi, Chang Gao

Abstract: With the rise in communication capacity, deep neural networks (DNN) for digital pre-distortion (DPD) to correct non-linearity in wideband power amplifiers (PAs) have become prominent. Yet, there is a void in open-source and measurement-setup-independent platforms for fast DPD exploration and objective DPD model comparison. This paper presents an open-source framework, OpenDPD, crafted in PyTorch,… ▽ More With the rise in communication capacity, deep neural networks (DNN) for digital pre-distortion (DPD) to correct non-linearity in wideband power amplifiers (PAs) have become prominent. Yet, there is a void in open-source and measurement-setup-independent platforms for fast DPD exploration and objective DPD model comparison. This paper presents an open-source framework, OpenDPD, crafted in PyTorch, with an associated dataset for PA modeling and DPD learning. We introduce a Dense Gated Recurrent Unit (DGRU)-DPD, trained via a novel end-to-end learning architecture, outperforming previous DPD models on a digital PA (DPA) in the new digital transmitter (DTX) architecture with unconventional transfer characteristics compared to analog PAs. Measurements show our DGRU-DPD achieves an ACPR of -44.69/-44.47 dBc and an EVM of -35.22 dB for 200 MHz OFDM signals. OpenDPD code, datasets, and documentation are publicly available at https://github.com/lab-emi/OpenDPD. △ Less

Submitted 24 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: To be published at the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore

arXiv:2401.06798 [pdf]

Evaluation of Mean Shift, ComBat, and CycleGAN for Harmonizing Brain Connectivity Matrices Across Sites

Authors: Hanliang Xu, Nancy R. Newlin, Michael E. Kim, Chenyu Gao, Praitayini Kanakaraj, Aravind R. Krishnan, Lucas W. Remedios, Nazirah Mohd Khairi, Kimberly Pechman, Derek Archer, Timothy J. Hohman, Angela L. Jefferson, The BIOCARD Study Team, Ivana Isgum, Yuankai Huo, Daniel Moyer, Kurt G. Schilling, Bennett A. Landman

Abstract: Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph… ▽ More Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph measures derived from these matrices before and after applying three harmonization techniques: mean shift, ComBat, and CycleGAN. The sample comprises 168 age-matched, sex-matched normal subjects from two studies: the Vanderbilt Memory and Aging Project (VMAP) and the Biomarkers of Cognitive Decline Among Normal Individuals (BIOCARD). First, we plotted the graph measures and used coefficient of variation (CoV) and the Mann-Whitney U test to evaluate different methods' effectiveness in removing site effects on the matrices and the derived graph measures. ComBat effectively eliminated site effects for global efficiency and modularity and outperformed the other two methods. However, all methods exhibited poor performance when harmonizing average betweenness centrality. Second, we tested whether our harmonization methods preserved correlations between age and graph measures. All methods except for CycleGAN in one direction improved correlations between age and global efficiency and between age and modularity from insignificant to significant with p-values less than 0.05. △ Less

Submitted 24 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: 11 pages, 5 figures, to be published in SPIE Medical Imaging 2024: Image Processing

arXiv:2312.16987 [pdf]

Image Quality, Uniformity and Computation Improvement of Compressive Light Field Displays with U-Net

Authors: Chen Gao, Haifeng Li, Xu Liu, Xiaodi Tan

Abstract: We apply the U-Net model for compressive light field synthesis. Compared to methods based on stacked CNN and iterative algorithms, this method offers better image quality, uniformity and less computation. We apply the U-Net model for compressive light field synthesis. Compared to methods based on stacked CNN and iterative algorithms, this method offers better image quality, uniformity and less computation. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: 4 pages, 6 figures, conference

MSC Class: 78-06 ACM Class: I.3.7

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2312.03284 [pdf]

Adaptive Multi-band Modulation for Robust and Low-complexity Faster-than-Nyquist Non-Orthogonal FDM IM-DD System

Authors: Peiji Song, Zhouyi Hu, Yizhan Dai, Yuan Liu, Chao Gao, Chun-Kit Chan

Abstract: Faster-than-Nyquist non-orthogonal frequency-division multiplexing (FTN-NOFDM) is robust against the steep frequency roll-off by saving signal bandwidth. Among the FTN-NOFDM techniques, the non-orthogonal matrix precoding (NOM-p) based FTN has high compatibility with the conventional orthogonal frequency division multiplexing (OFDM), in terms of the advanced digital signal processing already used… ▽ More Faster-than-Nyquist non-orthogonal frequency-division multiplexing (FTN-NOFDM) is robust against the steep frequency roll-off by saving signal bandwidth. Among the FTN-NOFDM techniques, the non-orthogonal matrix precoding (NOM-p) based FTN has high compatibility with the conventional orthogonal frequency division multiplexing (OFDM), in terms of the advanced digital signal processing already used in OFDM. In this work, by dividing the single band into multiple sub-bands in the NOM-p-based FTN-NOFDM system, we propose a novel FTN-NOFDM scheme with adaptive multi-band modulation. The proposed scheme assigns different quadrature amplitude modulation (QAM) levels to different sub-bands, effectively utilizing the low-pass-like channel and reducing the complexity. The impacts of sub-band number and bandwidth compression factor on the bit-error-rate (BER) performance and implementation complexity are experimentally analyzed with a 32.23-Gb/s and 20-km intensity modulation-direct detection (IM-DD) optical transmission system. Results show that the proposed scheme with proper sub-band numbers can lower BER and greatly reduce the complexity compared to the conventional single-band way. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.12199 [pdf, other]

Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation

Authors: Chenyang Gao, Yue Gu, Ivan Marsic

Abstract: In supervised speech separation, permutation invariant training (PIT) is widely used to handle label ambiguity by selecting the best permutation to update the model. Despite its success, previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments. To address this issue, we propose a novel training stra… ▽ More In supervised speech separation, permutation invariant training (PIT) is widely used to handle label ambiguity by selecting the best permutation to update the model. Despite its success, previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments. To address this issue, we propose a novel training strategy, dynamic sample dropout (DSD), which considers previous best label assignments and evaluation metrics to exclude the samples that may negatively impact the learned label assignments during training. Additionally, we include layer-wise optimization (LO) to improve the performance by solving layer-decoupling. Our experiments showed that combining DSD and LO outperforms the baseline and solves excessive label assignment switching and layer-decoupling issues. The proposed DSD and LO approach is easy to implement, requires no extra training sets or steps, and shows generality to various speech separation tasks. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2311.03500 [pdf]

Predicting Age from White Matter Diffusivity with Residual Learning

Authors: Chenyu Gao, Michael E. Kim, Ho Hin Lee, Qi Yang, Nazirah Mohd Khairi, Praitayini Kanakaraj, Nancy R. Newlin, Derek B. Archer, Angela L. Jefferson, Warren D. Taylor, Brian D. Boyd, Lori L. Beason-Held, Susan M. Resnick, The BIOCARD Study Team, Yuankai Huo, Katherine D. Van Schaik, Kurt G. Schilling, Daniel Moyer, Ivana Išgum, Bennett A. Landman

Abstract: Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis,… ▽ More Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis, diffusion tensor imaging (DTI) has proven effective in identifying age-related microstructural changes within the brain white matter, thereby presenting itself as a promising additional modality for brain age prediction. Although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction. △ Less

Submitted 21 January, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: SPIE Medical Imaging: Image Processing. San Diego, CA. February 2024 (accepted as poster presentation)

arXiv:2311.02842 [pdf, other]

An invariant feature extraction for multi-modal images matching

Authors: Chenzhong Gao, Wei Li

Abstract: This paper aims at providing an effective multi-modal images invariant feature extraction and matching algorithm for the application of multi-source data analysis. Focusing on the differences and correlation of multi-modal images, a feature-based matching algorithm is implemented. The key technologies include phase congruency (PC) and Shi-Tomasi feature point for keypoints detection, LogGabor filt… ▽ More This paper aims at providing an effective multi-modal images invariant feature extraction and matching algorithm for the application of multi-source data analysis. Focusing on the differences and correlation of multi-modal images, a feature-based matching algorithm is implemented. The key technologies include phase congruency (PC) and Shi-Tomasi feature point for keypoints detection, LogGabor filter and a weighted partial main orientation map (WPMOM) for feature extraction, and a multi-scale process to deal with scale differences and optimize matching results. The experimental results on practical data from multiple sources prove that the algorithm has effective performances on multi-modal images, which achieves accurate spatial alignment, showing practical application value and good generalization. △ Less

Submitted 5 November, 2023; originally announced November 2023.

arXiv:2310.17190 [pdf, other]

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

Authors: Feng Zhang, Ming Tian, Zhiqiang Li, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang

Abstract: Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver… ▽ More Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods. △ Less

Submitted 3 January, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: 12 pages, 6 figures, accepted by NeurlPS 2023

arXiv:2310.09071 [pdf, other]

Online Relocating and Matching of Ride-Hailing Services: A Model-Based Modular Approach

Authors: Chang Gao, Xi Lin, Fang He, Xindi Tang

Abstract: This study proposes an innovative model-based modular approach (MMA) to dynamically optimize order matching and vehicle relocation in a ride-hailing platform. MMA utilizes a two-layer and modular modeling structure. The upper layer determines the spatial transfer patterns of vehicle flow within the system to maximize the total revenue of the current and future stages. With the guidance provided by… ▽ More This study proposes an innovative model-based modular approach (MMA) to dynamically optimize order matching and vehicle relocation in a ride-hailing platform. MMA utilizes a two-layer and modular modeling structure. The upper layer determines the spatial transfer patterns of vehicle flow within the system to maximize the total revenue of the current and future stages. With the guidance provided by the upper layer, the lower layer performs rapid vehicle-to-order matching and vehicle relocation. MMA is interpretable, and equipped with the customized and polynomial-time algorithm, which, as an online order-matching and vehicle-relocation algorithm, can scale past thousands of vehicles. We theoretically prove that the proposed algorithm can achieve the global optimum in stylized networks, while the numerical experiments based on both the toy network and realistic dataset demonstrate that MMA is capable of achieving superior systematic performance compared to batch matching and reinforcement-learning based methods. Moreover, its modular and lightweight modeling structure further enables it to achieve a high level of robustness against demand variation while maintaining a relatively low computational cost. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2309.12953 [pdf]

Inter-vendor harmonization of Computed Tomography (CT) reconstruction kernels using unpaired image translation

Authors: Aravind R. Krishnan, Kaiwen Xu, Thomas Li, Chenyu Gao, Lucas W. Remedios, Praitayini Kanakaraj, Ho Hin Lee, Shunxing Bao, Kim L. Sandler, Fabien Maldonado, Ivana Isgum, Bennett A. Landman

Abstract: The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmoni… ▽ More The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification. △ Less

Submitted 26 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

Comments: 10 pages, 6 figures, 1 table, Submitted to SPIE Medical Imaging : Image Processing. San Diego, CA. February 2024

arXiv:2307.16508 [pdf, other]

Towards General Low-Light Raw Noise Synthesis and Modeling

Authors: Feng Zhang, Bin Xu, Zhiqiang Li, Xinran Liu, Qingbo Lu, Changxin Gao, Nong Sang

Abstract: Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To addres… ▽ More Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors. △ Less

Submitted 17 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: 11 pages, 7 figures. Accepted by ICCV 2023

Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 10820-10830

arXiv:2307.02953 [pdf, other]

SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks

Authors: Junlong Cheng, Chengrui Gao, Fengjie Wang, Min Zhu

Abstract: Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexit… ▽ More Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexity of the network; 2) simply fuse the features of encoder and decoder, ignoring the connection between their spatial locations. In this paper, we rethink the above problem and build a lightweight medical image segmentation network, called SegNetr. Specifically, we introduce a novel SegNetr block that can perform local-global interactions dynamically at any stage and with only linear complexity. At the same time, we design a general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features. We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs than vanilla U-Net, while achieving segmentation performance comparable to state-of-the-art methods. Notably, the components proposed in this paper can be applied to other U-shaped networks to improve their segmentation performance. △ Less

Submitted 21 July, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2304.11316 [pdf, other]

Iterative fluctuation ghost imaging

Authors: Huan Zhao, Xiao-Qian Wang, Chao Gao, Zhuo Yu, Hong Wang, Yu Wang, Li-Dan Gou, Zhi-Hai Yao

Abstract: We present a new technique, iterative fluctuation ghost imaging (IFGI) which dramatically enhances the resolution of ghost imaging (GI). It is shown that, by the fluctuation characteristics of the second-order correlation function, the imaging information with the narrower point spread function (PSF) than the original information can be got. The effects arising from the PSF and the iteration times… ▽ More We present a new technique, iterative fluctuation ghost imaging (IFGI) which dramatically enhances the resolution of ghost imaging (GI). It is shown that, by the fluctuation characteristics of the second-order correlation function, the imaging information with the narrower point spread function (PSF) than the original information can be got. The effects arising from the PSF and the iteration times also be discussed. △ Less

Submitted 22 April, 2023; originally announced April 2023.

arXiv:2303.17867 [pdf, other]

CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer

Authors: Linfeng Wen, Chengying Gao, Changqing Zou

Abstract: Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but n… ▽ More Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but not introduce redundant information as traditional reversible networks, and hence facilitate better stylization. Empowered by Matting Laplacian training loss which can address the pixel affinity loss problem led by the linear transform, the proposed framework is applicable and effective on versatile style transfer. Extensive experiments show that CAP-VSTNet can produce better qualitative and quantitative results in comparison with the state-of-the-art methods. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2303.04255 [pdf, other]

Self-supervised speech representation learning for keyword-spotting with light-weight transformers

Authors: Chenyang Gao, Yue Gu, Francesco Caliva, Yuzong Liu

Abstract: Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters an… ▽ More Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters and propose a mechanism to enhance utterance-wise distinction, which proves crucial for improving performance on classification tasks. On the Google speech commands v2 dataset, the proposed method applied to the Auto-Regressive Predictive Coding S3RL led to a 1.2% accuracy improvement compared to training from scratch. On an in-house KS dataset with four different keywords, it provided 6% to 23.7% relative false accept improvement at fixed false reject rate. We argue this demonstrates the applicability of S3RL approaches to light-weight models for KS and confirms S3RL is a powerful alternative to traditional supervised learning for resource-constrained applications. △ Less

Submitted 7 March, 2023; originally announced March 2023.

arXiv:2302.13222 [pdf, other]

Speech Corpora Divergence Based Unsupervised Data Selection for ASR

Authors: Changfeng Gao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

Abstract: Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert… ▽ More Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents from Common Voice. Experiments show that the proposed SCD data selection can realize 14.8% relative improvements to the random selection, comparable or even superior to the result of supervised selection. △ Less

Submitted 25 February, 2023; originally announced February 2023.

arXiv:2209.09244 [pdf, other]

Flexible Neural Image Compression via Code Editing

Authors: Chenjian Gao, Tongda Xu, Dailan He, Hongwei Qin, Yan Wang

Abstract: Neural image compression (NIC) has outperformed traditional image codecs in rate-distortion (R-D) performance. However, it usually requires a dedicated encoder-decoder pair for each point on R-D curve, which greatly hinders its practical deployment. While some recent works have enabled bitrate control via conditional coding, they impose strong prior during training and provide limited flexibility.… ▽ More Neural image compression (NIC) has outperformed traditional image codecs in rate-distortion (R-D) performance. However, it usually requires a dedicated encoder-decoder pair for each point on R-D curve, which greatly hinders its practical deployment. While some recent works have enabled bitrate control via conditional coding, they impose strong prior during training and provide limited flexibility. In this paper we propose Code Editing, a highly flexible coding method for NIC based on semi-amortized inference and adaptive quantization. Our work is a new paradigm for variable bitrate NIC. Furthermore, experimental results show that our method surpasses existing variable-rate methods, and achieves ROI coding and multi-distortion trade-off with a single decoder. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: NeurIPS 2022

arXiv:2208.00693 [pdf, other]

doi 10.1109/JSSC.2022.3195610

A 23 $μ$W Keyword Spotting IC with Ring-Oscillator-Based Time-Domain Feature Extraction

Authors: Kwantae Kim, Chang Gao, Rui Graça, Ilya Kiselev, Hoi-Jun Yoo, Tobi Delbruck, Shih-Chii Liu

Abstract: This article presents the first keyword spotting (KWS) IC which uses a ring-oscillator-based time-domain processing technique for its analog feature extractor (FEx). Its extensive usage of time-encoding schemes allows the analog audio signal to be processed in a fully time-domain manner except for the voltage-to-time conversion stage of the analog front-end. Benefiting from fundamental building bl… ▽ More This article presents the first keyword spotting (KWS) IC which uses a ring-oscillator-based time-domain processing technique for its analog feature extractor (FEx). Its extensive usage of time-encoding schemes allows the analog audio signal to be processed in a fully time-domain manner except for the voltage-to-time conversion stage of the analog front-end. Benefiting from fundamental building blocks based on digital logic gates, it offers a better technology scalability compared to conventional voltage-domain designs. Fabricated in a 65 nm CMOS process, the prototyped KWS IC occupies 2.03mm$^{2}$ and dissipates 23 $μ$W power consumption including analog FEx and digital neural network classifier. The 16-channel time-domain FEx achieves 54.89 dB dynamic range for 16 ms frame shift size while consuming 9.3 $μ$W. The measurement result verifies that the proposed IC performs a 12-class KWS task on the Google Speech Command Dataset (GSCD) with >86% accuracy and 12.4 ms latency. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: 14 pages, 21 figures, 2 tables

arXiv:2206.07219 [pdf, ps, other]

A Projection-Based K-space Transformer Network for Undersampled Radial MRI Reconstruction with Limited Training Subjects

Authors: Chang Gao, Shu-Fu Shih, J. Paul Finn, Xiaodong Zhong

Abstract: The recent development of deep learning combined with compressed sensing enables fast reconstruction of undersampled MR images and has achieved state-of-the-art performance for Cartesian k-space trajectories. However, non-Cartesian trajectories such as the radial trajectory need to be transformed onto a Cartesian grid in each iteration of the network training, slowing down the training process and… ▽ More The recent development of deep learning combined with compressed sensing enables fast reconstruction of undersampled MR images and has achieved state-of-the-art performance for Cartesian k-space trajectories. However, non-Cartesian trajectories such as the radial trajectory need to be transformed onto a Cartesian grid in each iteration of the network training, slowing down the training process and posing inconvenience and delay during training. Multiple iterations of nonuniform Fourier transform in the networks offset the deep learning advantage of fast inference. Current approaches typically either work on image-to-image networks or grid the non-Cartesian trajectories before the network training to avoid the repeated gridding process. However, the image-to-image networks cannot ensure the k-space data consistency in the reconstructed images and the pre-processing of non-Cartesian k-space leads to gridding errors which cannot be compensated by the network training. Inspired by the Transformer network to handle long-range dependencies in sequence transduction tasks, we propose to rearrange the radial spokes to sequential data based on the chronological order of acquisition and use the Transformer to predict unacquired radial spokes from acquired ones. We propose novel data augmentation methods to generate a large amount of training data from a limited number of subjects. The network can be generated to different anatomical structures. Experimental results show superior performance of the proposed framework compared to state-of-the-art deep neural networks. △ Less

Submitted 25 July, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: Accepted at MICCAI 2022

arXiv:2206.06127 [pdf, other]

SyntheX: Scaling Up Learning-based X-ray Image Analysis Through In Silico Experiments

Authors: Cong Gao, Benjamin D. Killeen, Yicheng Hu, Robert B. Grupp, Russell H. Taylor, Mehran Armand, Mathias Unberath

Abstract: Artificial intelligence (AI) now enables automated interpretation of medical images for clinical use. However, AI's potential use for interventional images (versus those involved in triage or diagnosis), such as for guidance during surgery, remains largely untapped. This is because surgical AI systems are currently trained using post hoc analysis of data collected during live surgeries, which has… ▽ More Artificial intelligence (AI) now enables automated interpretation of medical images for clinical use. However, AI's potential use for interventional images (versus those involved in triage or diagnosis), such as for guidance during surgery, remains largely untapped. This is because surgical AI systems are currently trained using post hoc analysis of data collected during live surgeries, which has fundamental and practical limitations, including ethical considerations, expense, scalability, data integrity, and a lack of ground truth. Here, we demonstrate that creating realistic simulated images from human models is a viable alternative and complement to large-scale in situ data collection. We show that training AI image analysis models on realistically synthesized data, combined with contemporary domain generalization or adaptation techniques, results in models that on real data perform comparably to models trained on a precisely matched real data training set. Because synthetic generation of training data from human-based models scales easily, we find that our model transfer paradigm for X-ray image analysis, which we refer to as SyntheX, can even outperform real data-trained models due to the effectiveness of training on a larger dataset. We demonstrate the potential of SyntheX on three clinical tasks: Hip image analysis, surgical robotic tool detection, and COVID-19 lung lesion segmentation. SyntheX provides an opportunity to drastically accelerate the conception, design, and evaluation of intelligent systems for X-ray-based medicine. In addition, simulated image environments provide the opportunity to test novel instrumentation, design complementary surgical approaches, and envision novel techniques that improve outcomes, save time, or mitigate human error, freed from the ethical and practical considerations of live human data collection. △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2205.14501 [pdf, other]

PO-ELIC: Perception-Oriented Efficient Learned Image Coding

Authors: Dailan He, Ziming Yang, Hongjiu Yu, Tongda Xu, Jixiang Luo, Yuan Chen, Chenjian Gao, Xinjie Shi, Hongwei Qin, Yan Wang

Abstract: In the past years, learned image compression (LIC) has achieved remarkable performance. The recent LIC methods outperform VVC in both PSNR and MS-SSIM. However, the low bit-rate reconstructions of LIC suffer from artifacts such as blurring, color drifting and texture missing. Moreover, those varied artifacts make image quality metrics correlate badly with human perceptual quality. In this paper, w… ▽ More In the past years, learned image compression (LIC) has achieved remarkable performance. The recent LIC methods outperform VVC in both PSNR and MS-SSIM. However, the low bit-rate reconstructions of LIC suffer from artifacts such as blurring, color drifting and texture missing. Moreover, those varied artifacts make image quality metrics correlate badly with human perceptual quality. In this paper, we propose PO-ELIC, i.e., Perception-Oriented Efficient Learned Image Coding. To be specific, we adapt ELIC, one of the state-of-the-art LIC models, with adversarial training techniques. We apply a mixture of losses including hinge-form adversarial loss, Charbonnier loss, and style loss, to finetune the model towards better perceptual quality. Experimental results demonstrate that our method achieves comparable perceptual quality with HiFiC with much lower bitrate. △ Less

Submitted 28 May, 2022; originally announced May 2022.

Comments: CVPR2022 Workshop, 5-th CLIC Image Compression Track

arXiv:2204.08797 [pdf, other]

doi 10.1109/TMI.2021.3124217

Two-Stream Graph Convolutional Network for Intra-oral Scanner Image Segmentation

Authors: Yue Zhao, Lingming Zhang, Yang Liu, Deyu Meng, Zhiming Cui, Chenqiang Gao, Xinbo Gao, Chunfeng Lian, Dinggang Shen

Abstract: Precise segmentation of teeth from intra-oral scanner images is an essential task in computer-aided orthodontic surgical planning. The state-of-the-art deep learning-based methods often simply concatenate the raw geometric attributes (i.e., coordinates and normal vectors) of mesh cells to train a single-stream network for automatic intra-oral scanner image segmentation. However, since different ra… ▽ More Precise segmentation of teeth from intra-oral scanner images is an essential task in computer-aided orthodontic surgical planning. The state-of-the-art deep learning-based methods often simply concatenate the raw geometric attributes (i.e., coordinates and normal vectors) of mesh cells to train a single-stream network for automatic intra-oral scanner image segmentation. However, since different raw attributes reveal completely different geometric information, the naive concatenation of different raw attributes at the (low-level) input stage may bring unnecessary confusion in describing and differentiating between mesh cells, thus hampering the learning of high-level geometric representations for the segmentation task. To address this issue, we design a two-stream graph convolutional network (i.e., TSGCN), which can effectively handle inter-view confusion between different raw attributes to more effectively fuse their complementary information and learn discriminative multi-view geometric representations. Specifically, our TSGCN adopts two input-specific graph-learning streams to extract complementary high-level geometric representations from coordinates and normal vectors, respectively. Then, these single-view representations are further fused by a self-attention module to adaptively balance the contributions of different views in learning more discriminative multi-view representations for accurate and fully automatic tooth segmentation. We have evaluated our TSGCN on a real-patient dataset of dental (mesh) models acquired by 3D intraoral scanners. Experimental results show that our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation. Github: https://github.com/ZhangLingMing1/TSGCNet. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Comments: 11 pages, 6 figures. arXiv admin note: text overlap with arXiv:2012.13697

Journal ref: IEEE Transactions on Medical Images, 41(4): 826-835, 2022

arXiv:2204.00260 [pdf, other]

doi 10.1109/TGRS.2022.3193109

MS-HLMO: Multi-scale Histogram of Local Main Orientation for Remote Sensing Image Registration

Authors: Chenzhong Gao, Wei Li, Ran Tao, Qian Du

Abstract: Multi-source image registration is challenging due to intensity, rotation, and scale differences among the images. Considering the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm named Multi-scale Histogram of Local Main Orientation (MS-HLMO) is proposed. Harris corner detection is first adopted to generate feature points. The HLMO feat… ▽ More Multi-source image registration is challenging due to intensity, rotation, and scale differences among the images. Considering the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm named Multi-scale Histogram of Local Main Orientation (MS-HLMO) is proposed. Harris corner detection is first adopted to generate feature points. The HLMO feature of each Harris feature point is extracted on a Partial Main Orientation Map (PMOM) with a Generalized Gradient Location and Orientation Histogram-like (GGLOH) feature descriptor, which provides high intensity, rotation, and scale invariance. The feature points are matched through a multi-scale matching strategy. Comprehensive experiments on 17 multi-source remote sensing scenes demonstrate that the proposed MS-HLMO and its simplified version MS-HLMO$^+$ outperform other competitive registration algorithms in terms of effectiveness and generalization. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2202.10916 [pdf, other]

Remaining Useful Life Prediction Using Temporal Deep Degradation Network for Complex Machinery with Attention-based Feature Extraction

Authors: Yuwen Qin, Ningbo Cai, Chen Gao, Yadong Zhang, Yonghong Cheng, Xin Chen

Abstract: The precise estimate of remaining useful life (RUL) is vital for the prognostic analysis and predictive maintenance that can significantly reduce failure rate and maintenance costs. The degradation-related features extracted from the sensor streaming data with neural networks can dramatically improve the accuracy of the RUL prediction. The Temporal deep degradation network (TDDN) model is proposed… ▽ More The precise estimate of remaining useful life (RUL) is vital for the prognostic analysis and predictive maintenance that can significantly reduce failure rate and maintenance costs. The degradation-related features extracted from the sensor streaming data with neural networks can dramatically improve the accuracy of the RUL prediction. The Temporal deep degradation network (TDDN) model is proposed to make the RUL prediction with the degradation-related features given by the one-dimensional convolutional neural network (1D CNN) feature extraction and attention mechanism. 1D CNN is used to extract the temporal features from the streaming sensor data. Temporal features have monotonic degradation trends from the fluctuating raw sensor streaming data. Attention mechanism can improve the RUL prediction performance by capturing the fault characteristics and the degradation development with the attention weights. The performance of the TDDN model is evaluated on the public C-MAPSS dataset and compared with the existing methods. The results show that the TDDN model can achieve the best RUL prediction accuracy in complex conditions compared to current machine learning models. The degradation-related features extracted from the high-dimension sensor streaming data demonstrate the clear degradation trajectories and degradation stages that enable TDDN to predict the turbofan-engine RUL accurately and efficiently. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.06707 [pdf, other]

doi 10.1109/TCSI.2022.3150165

Spiking Cochlea with System-level Local Automatic Gain Control

Authors: Ilya Kiselev, Chang Gao, Shih-Chii Liu

Abstract: Including local automatic gain control (AGC) circuitry into a silicon cochlea design has been challenging because of transistor mismatch and model complexity. To address this, we present an alternative system-level algorithm that implements channel-specific AGC in a silicon spiking cochlea by measuring the output spike activity of individual channels. The bandpass filter gain of a channel is adapt… ▽ More Including local automatic gain control (AGC) circuitry into a silicon cochlea design has been challenging because of transistor mismatch and model complexity. To address this, we present an alternative system-level algorithm that implements channel-specific AGC in a silicon spiking cochlea by measuring the output spike activity of individual channels. The bandpass filter gain of a channel is adapted dynamically to the input amplitude so that the average output spike rate stays within a defined range. Because this AGC mechanism only needs counting and adding operations, it can be implemented at low hardware cost in a future design. We evaluate the impact of the local AGC algorithm on a classification task where the input signal varies over 32 dB input range. Two classifier types receiving cochlea spike features were tested on a speech versus noise classification task. The logistic regression classifier achieves an average of 6% improvement and 40.8% relative improvement in accuracy when the AGC is enabled. The deep neural network classifier shows a similar improvement for the AGC case and achieves a higher mean accuracy of 96% compared to the best accuracy of 91% from the logistic regression classifier. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Comments: Accepted for publication at the IEEE Transactions on Circuits and Systems I - Regular Papers, 2022

arXiv:2112.12522 [pdf, other]

Multi-Variant Consistency based Self-supervised Learning for Robust Automatic Speech Recognition

Authors: Changfeng Gao, Gaofeng Cheng, Pengyuan Zhang

Abstract: Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades significantly in far-field and noisy environments. The recent development of self-supervised learning (SSL) technology can improve the ASR performance by pre-training the model with additional unlabeled speech and the SSL pre-trained model has achieved the state-of-the-art result on several speech benchm… ▽ More Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades significantly in far-field and noisy environments. The recent development of self-supervised learning (SSL) technology can improve the ASR performance by pre-training the model with additional unlabeled speech and the SSL pre-trained model has achieved the state-of-the-art result on several speech benchmarks. Nevertheless, most of the previous SSL methods ignore the influence of the background noise or reverberation, which is crucial to deploying ASR systems in real-world speech applications. This study addresses the robust ASR by introducing a multi-variant consistency (MVC) based SSL method that adapts to different environments. The MVC-SSL is a robust SSL pre-training method designed for noisy and distant-talking speech in real-world applications. Compared to the previous SSL method, the MVC-SSL can calculate the contrastive loss among audios from different acoustic conditions or channels and can learn invariant representations with the change in the environment or the recording equipment. We also explore different SSL training pipelines to balance the noisy distant-talking speech and extra high resource clean speech. We evaluate the proposed method on the commercially-motivated dataset, CHiME-4, and the meeting dataset, AMI. With the help of the MVC-SSL and appropriate training pipeline, we can achieve up to 30% relative word error rate reductions over the baseline wav2vec2.0, one of the most successful SSL methods for ASR. △ Less

Submitted 4 May, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: 6 pages, 3 figures

arXiv:2112.01766 [pdf, other]

Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior

Authors: Feng Zhang, Yuanjie Shao, Yishi Sun, Kai Zhu, Changxin Gao, Nong Sang

Abstract: Deep learning-based methods for low-light image enhancement typically require enormous paired training data, which are impractical to capture in real-world scenarios. Recently, unsupervised approaches have been explored to eliminate the reliance on paired training data. However, they perform erratically in diverse real-world scenarios due to the absence of priors. To address this issue, we propose… ▽ More Deep learning-based methods for low-light image enhancement typically require enormous paired training data, which are impractical to capture in real-world scenarios. Recently, unsupervised approaches have been explored to eliminate the reliance on paired training data. However, they perform erratically in diverse real-world scenarios due to the absence of priors. To address this issue, we propose an unsupervised low-light image enhancement method based on an effective prior termed histogram equalization prior (HEP). Our work is inspired by the interesting observation that the feature maps of histogram equalization enhanced image and the ground truth are similar. Specifically, we formulate the HEP to provide abundant texture and luminance information. Embedded into a Light Up Module (LUM), it helps to decompose the low-light images into illumination and reflectance maps, and the reflectance maps can be regarded as restored images. However, the derivation based on Retinex theory reveals that the reflectance maps are contaminated by noise. We introduce a Noise Disentanglement Module (NDM) to disentangle the noise and content in the reflectance maps with the reliable aid of unpaired clean images. Guided by the histogram equalization prior and noise disentanglement, our method can recover finer details and is more capable to suppress noise in real-world low-light scenarios. Extensive experiments demonstrate that our method performs favorably against the state-of-the-art unsupervised low-light enhancement algorithms and even matches the state-of-the-art supervised algorithms. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: submitted to IEEE Transactions on Image Processing

arXiv:2110.14484 [pdf]

PL-Net: Progressive Learning Network for Medical Image Segmentation

Authors: Junlong Cheng, Chengrui Gao, Hongchun Lu, Zhangqiang Ming, Yong Yang, Min Zhu

Abstract: In recent years, segmentation methods based on deep convolutional neural networks (CNNs) have made state-of-the-art achievements for many medical analysis tasks. However, most of these approaches improve performance by optimizing the structure or adding new functional modules of the U-Net, which ignoring the complementation and fusion of the coarse-grained and fine-grained semantic information. To… ▽ More In recent years, segmentation methods based on deep convolutional neural networks (CNNs) have made state-of-the-art achievements for many medical analysis tasks. However, most of these approaches improve performance by optimizing the structure or adding new functional modules of the U-Net, which ignoring the complementation and fusion of the coarse-grained and fine-grained semantic information. To solve the above problems, we propose a medical image segmentation framework called progressive learning network (PL-Net), which includes internal progressive learning (IPL) and external progressive learning (EPL). PL-Net has the following advantages: (1) IPL divides feature extraction into two "steps", which can mix different size receptive fields and capture semantic information from coarse to fine granularity without introducing additional parameters; (2) EPL divides the training process into two "stages" to optimize parameters, and realizes the fusion of coarse-grained information in the previous stage and fine-grained information in the latter stage. We evaluate our method in different medical image analysis tasks, and the results show that the segmentation performance of PL-Net is better than the state-of-the-art methods of U-Net and its variants. △ Less

Submitted 29 August, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

arXiv:2110.10593 [pdf]

Progressive Learning for Stabilizing Label Selection in Speech Separation with Mapping-based Method

Authors: Chenyang Gao, Yue Gu, Ivan Marsic

Abstract: Speech separation has been studied in time domain because of lower latency and higher performance compared to time-frequency domain. The masking-based method has been mostly used in time domain, and the other common method (mapping-based) has been inadequately studied. We investigate the use of the mapping-based method in the time domain and show that it can perform better on a large training set… ▽ More Speech separation has been studied in time domain because of lower latency and higher performance compared to time-frequency domain. The masking-based method has been mostly used in time domain, and the other common method (mapping-based) has been inadequately studied. We investigate the use of the mapping-based method in the time domain and show that it can perform better on a large training set than the masking-based method. We also investigate the frequent label-switching problem in permutation invariant training (PIT), which results in suboptimal training because the labels selected by PIT differ across training epochs. Our experiment results showed that PIT works well in a shallow separation model, and the label switching occurs for a deeper model. We inferred that layer decoupling may be the reason for the frequent label switching. Therefore, we propose a training strategy based on progressive learning. This approach significantly reduced inconsistent label assignment without added computational complexity or training corpus. By combining this training strategy with the mapping-based method, we significantly improved the separation performance compared to the baseline. △ Less

Submitted 21 March, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

Comments: Submitted to Interspeech 2022

arXiv:2108.06054 [pdf, other]

Local Patch Network with Global Attention for Infrared Small Target Detection

Authors: Fang Chen, Chenqiang Gao, Fangcen Liu, Yue Zhao, Yuxi Zhou, Deyu Meng, Wangmeng Zuo

Abstract: Infrared small target detection plays an important role in the infrared search and tracking applications. In recent years, deep learning techniques were introduced to this task and achieved noteworthy effects. Following general object segmentation methods, existing deep learning methods usually processed the image from the global view. However, the imaging locality of small targets and extreme cla… ▽ More Infrared small target detection plays an important role in the infrared search and tracking applications. In recent years, deep learning techniques were introduced to this task and achieved noteworthy effects. Following general object segmentation methods, existing deep learning methods usually processed the image from the global view. However, the imaging locality of small targets and extreme class-imbalance between the target and background pixels were not well-considered by these deep learning methods, which causes the low-efficiency on training and high-dependence on numerous data. A local patch network (LPNet) with global attention is proposed in this paper to detect small targets by jointly considering the global and local properties of infrared small target images. From the global view, a supervised attention module trained by the small target spread map is proposed to suppress most background pixels irrelevant with small target features. From the local view, local patches are split from global features and share the same convolution weights with each other in a patch net. By leveraging both the global and local properties, the data-driven framework proposed in this paper has fused multi-scale features for small target detection. Extensive synthetic and real data experiments show that the proposed method achieves the state-of-the-art performance compared with existing both conventional and deep learning methods. △ Less

Submitted 29 September, 2021; v1 submitted 13 August, 2021; originally announced August 2021.

Comments: 11 pages, 7 figures

arXiv:2107.05254 [pdf, ps, other]

Asymptotic analysis of V-BLAST MIMO for coherent optical wireless communications in Gamma-Gamma turbulence

Authors: Yiming Li, Chao Gao, Mark S. Leeson, Xiaofeng Li

Abstract: This paper investigates the asymptotic BER performance of coherent optical wireless communication systems in Gamma-Gamma turbulence when applying the V-BLAST MIMO scheme. A new method is proposed to quantify the performance of the system and mathematical solutions for asymptotic BER performance are derived. Counterintuitive results are shown since the diversity gain of the V-BLAST MIMO system is e… ▽ More This paper investigates the asymptotic BER performance of coherent optical wireless communication systems in Gamma-Gamma turbulence when applying the V-BLAST MIMO scheme. A new method is proposed to quantify the performance of the system and mathematical solutions for asymptotic BER performance are derived. Counterintuitive results are shown since the diversity gain of the V-BLAST MIMO system is equal to the number of the receivers. As a consequence, it is shown that when applying the V-BLAST MIMO scheme, the symbol rate per transmission can be equal to the number of transmitters with some cost to diversity gain. This means that we can simultaneously exploit the spatial multiplexing and diversity properties of the MIMO system to achieve a higher data rate than existing schemes in a channel that displays severe turbulence and moderate attenuation. △ Less

Submitted 12 July, 2021; originally announced July 2021.

arXiv:2104.12572 [pdf, other]

Multi-scale PIIFD for Registration of Multi-source Remote Sensing Images

Authors: Chenzhong Gao, Wei Li

Abstract: This paper aims at providing multi-source remote sensing images registered in geometric space for image fusion. Focusing on the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm is implemented. The key technologies include image scale-space for implementing multi-scale properties, Harris corner detection for keypoints extraction, and part… ▽ More This paper aims at providing multi-source remote sensing images registered in geometric space for image fusion. Focusing on the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm is implemented. The key technologies include image scale-space for implementing multi-scale properties, Harris corner detection for keypoints extraction, and partial intensity invariant feature descriptor (PIIFD) for keypoints description. Eventually, a multi-scale Harris-PIIFD image registration algorithm framework is proposed. The experimental results of four sets of representative real data show that the algorithm has excellent, stable performance in multi-source remote sensing image registration, and can achieve accurate spatial alignment, which has strong practical application value and certain generalization ability. △ Less

Submitted 26 April, 2021; originally announced April 2021.

arXiv:2009.02528 [pdf, other]

Structured Sparsity Modeling for Improved Multivariate Statistical Analysis based Fault Isolation

Authors: Wei Chen, Jiusun Zeng, Xiaobin Xu, Shihua Luo, Chuanhou Gao

Abstract: In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regular… ▽ More In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regularization terms. The structured sparsity terms allow selection of fault variables over structures like blocks or networks of process variables, hence more accurate fault isolation can be achieved. Four structured sparsity terms corresponding to different kinds of process information are considered, namely, partially known sparse support, block sparsity, clustered sparsity and tree-structured sparsity. The optimization problems involving the structured sparsity terms can be solved using the Alternating Direction Method of Multipliers (ADMM) algorithm, which is fast and efficient. Through a simulation example and an application study to a coal-fired power plant, it is verified that the proposed method can better isolate faulty variables by incorporating process structure information. △ Less

Submitted 21 December, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: 36 pages, 12 figures

arXiv:2008.06262 [pdf, other]

doi 10.1007/s11548-020-02249-1

A Learning-based Method for Online Adjustment of C-arm Cone-Beam CT Source Trajectories for Artifact Avoidance

Authors: Mareike Thies, Jan-Nico Zäch, Cong Gao, Russell Taylor, Nassir Navab, Andreas Maier, Mathias Unberath

Abstract: During spinal fusion surgery, screws are placed close to critical nerves suggesting the need for highly accurate screw placement. Verifying screw placement on high-quality tomographic imaging is essential. C-arm Cone-beam CT (CBCT) provides intraoperative 3D tomographic imaging which would allow for immediate verification and, if needed, revision. However, the reconstruction quality attainable wit… ▽ More During spinal fusion surgery, screws are placed close to critical nerves suggesting the need for highly accurate screw placement. Verifying screw placement on high-quality tomographic imaging is essential. C-arm Cone-beam CT (CBCT) provides intraoperative 3D tomographic imaging which would allow for immediate verification and, if needed, revision. However, the reconstruction quality attainable with commercial CBCT devices is insufficient, predominantly due to severe metal artifacts in the presence of pedicle screws. These artifacts arise from a mismatch between the true physics of image formation and an idealized model thereof assumed during reconstruction. Prospectively acquiring views onto anatomy that are least affected by this mismatch can, therefore, improve reconstruction quality. We propose to adjust the C-arm CBCT source trajectory during the scan to optimize reconstruction quality with respect to a certain task, i.e. verification of screw placement. Adjustments are performed on-the-fly using a convolutional neural network that regresses a quality index for possible next views given the current x-ray image. Adjusting the CBCT trajectory to acquire the recommended views results in non-circular source orbits that avoid poor images, and thus, data inconsistencies. We demonstrate that convolutional neural networks trained on realistically simulated data are capable of predicting quality metrics that enable scene-specific adjustments of the CBCT source trajectory. Using both realistically simulated data and real CBCT acquisitions of a semi-anthropomorphic phantom, we show that tomographic reconstructions of the resulting scene-specific CBCT acquisitions exhibit improved image quality particularly in terms of metal artifacts. Since the optimization objective is implicitly encoded in a neural network, the proposed approach overcomes the need for 3D information at run-time. △ Less

Submitted 14 August, 2020; originally announced August 2020.

Comments: 12 pages

Journal ref: Int. J. CARS 15 (2020) 1787-1796

arXiv:2006.01435 [pdf, other]

Recapture as You Want

Authors: Chen Gao, Si Liu, Ran He, Shuicheng Yan, Bo Li

Abstract: With the increasing prevalence and more powerful camera systems of mobile devices, people can conveniently take photos in their daily life, which naturally brings the demand for more intelligent photo post-processing techniques, especially on those portrait photos. In this paper, we present a portrait recapture method enabling users to easily edit their portrait to desired posture/view, body figur… ▽ More With the increasing prevalence and more powerful camera systems of mobile devices, people can conveniently take photos in their daily life, which naturally brings the demand for more intelligent photo post-processing techniques, especially on those portrait photos. In this paper, we present a portrait recapture method enabling users to easily edit their portrait to desired posture/view, body figure and clothing style, which are very challenging to achieve since it requires to simultaneously perform non-rigid deformation of human body, invisible body-parts reasoning and semantic-aware editing. We decompose the editing procedure into semantic-aware geometric and appearance transformation. In geometric transformation, a semantic layout map is generated that meets user demands to represent part-level spatial constraints and further guides the semantic-aware appearance transformation. In appearance transformation, we design two novel modules, Semantic-aware Attentive Transfer (SAT) and Layout Graph Reasoning (LGR), to conduct intra-part transfer and inter-part reasoning, respectively. SAT module produces each human part by paying attention to the semantically consistent regions in the source portrait. It effectively addresses the non-rigid deformation issue and well preserves the intrinsic structure/appearance with rich texture details. LGR module utilizes body skeleton knowledge to construct a layout graph that connects all relevant part features, where graph reasoning mechanism is used to propagate information among part nodes to mine their relations. In this way, LGR module infers invisible body parts and guarantees global coherence among all the parts. Extensive experiments on DeepFashion, Market-1501 and in-the-wild photos demonstrate the effectiveness and superiority of our approach. Video demo is at: \url{https://youtu.be/vTyq9HL6jgw}. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: 14 pages

Showing 1–50 of 61 results for author: Gao, C