Search | arXiv e-print repository

On Fixed-Time Stability for a Class of Singularly Perturbed Systems using Composite Lyapunov Functions

Authors: Michael Tang, Miroslav Krstic, Jorge Poveda

Abstract: Fixed-time stable dynamical systems are capable of achieving exact convergence to an equilibrium point within a fixed time that is independent of the initial conditions of the system. This property makes them highly appealing for designing control, estimation, and optimization algorithms in applications with stringent performance requirements. However, the set of tools available for analyzing the… ▽ More Fixed-time stable dynamical systems are capable of achieving exact convergence to an equilibrium point within a fixed time that is independent of the initial conditions of the system. This property makes them highly appealing for designing control, estimation, and optimization algorithms in applications with stringent performance requirements. However, the set of tools available for analyzing the interconnection of fixed-time stable systems is rather limited compared to their asymptotic counterparts. In this paper, we address some of these limitations by exploiting the emergence of multiple time scales in nonlinear singularly perturbed dynamical systems, where the fast dynamics and the slow dynamics are fixed-time stable on their own. By extending the so-called composite Lyapunov method from asymptotic stability to the context of fixed-time stability, we provide a novel class of Lyapunov-based sufficient conditions to certify fixed-time stability in a class of singularly perturbed dynamical systems. The results are illustrated, analytically and numerically, using a fixed-time gradient flow system interconnected with a fixed-time plant and an additional high-order example. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.03647 [pdf, other]

Real-time Event Recognition of Long-distance Distributed Vibration Sensing with Knowledge Distillation and Hardware Acceleration

Authors: Zhongyao Luo, Hao Wu, Zhao Ge, Ming Tang

Abstract: Fiber-optic sensing, especially distributed optical fiber vibration (DVS) sensing, is gaining importance in internet of things (IoT) applications, such as industrial safety monitoring and intrusion detection. Despite their wide application, existing post-processing methods that rely on deep learning models for event recognition in DVS systems face challenges with real-time processing of large samp… ▽ More Fiber-optic sensing, especially distributed optical fiber vibration (DVS) sensing, is gaining importance in internet of things (IoT) applications, such as industrial safety monitoring and intrusion detection. Despite their wide application, existing post-processing methods that rely on deep learning models for event recognition in DVS systems face challenges with real-time processing of large sample data volumes, particularly in long-distance applications. To address this issue, we propose to use a four-layer convolutional neural network (CNN) with ResNet as the teacher model for knowledge distillation. This results in a significant improvement in accuracy, from 83.41% to 95.39%, on data from previously untrained environments. Additionally, we propose a novel hardware design based on field-programmable gate arrays (FPGA) to further accelerate model inference. This design replaces multiplication with binary shift operations and quantizes model weights, enabling high parallelism and low latency. Our implementation achieves an inference time of 0.083 ms for a spatial-temporal sample covering a 12.5 m fiber length and 0.256 s time frame. This performance enables real-time signal processing over approximately 38.55 km of fiber, about $2.14\times$ the capability of an Nvidia GTX 4090 GPU. The proposed method greatly enhances the efficiency of vibration pattern recognition, promoting the use of DVS as a smart IoT system. The data and code are available at https://github.com/HUST-IOF/Efficient-DVS. △ Less

Submitted 22 August, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

Comments: 9 pages, 10 figures

arXiv:2407.06373 [pdf]

Enhancing super-resolution ultrasound localisation through multi-frame deconvolution exploiting spatiotemporal coherence

Authors: Su Yan, Clotilde Vié, Marcelo Lerendegui, Herman Verinaz-Jadan, Jipeng Yan, Martina Tashkova, James Burn, Bingxue Wang, Gary Frost, Kevin G. Murphy, Meng-Xing Tang

Abstract: Super-resolution ultrasound imaging through microbubble (MB) localisation and tracking, also known as ultrasound localisation microscopy, allows non-invasive sub-diffraction resolution imaging of microvasculature in animals and humans. The number of MBs localised from the acquired contrast-enhanced ultrasound (CEUS) images and the localisation precision directly influence the quality of the result… ▽ More Super-resolution ultrasound imaging through microbubble (MB) localisation and tracking, also known as ultrasound localisation microscopy, allows non-invasive sub-diffraction resolution imaging of microvasculature in animals and humans. The number of MBs localised from the acquired contrast-enhanced ultrasound (CEUS) images and the localisation precision directly influence the quality of the resulting super-resolution microvasculature images. However, non-negligible noise present in the CEUS images can make localising MBs challenging. To enhance the MB localisation performance, we propose a Multi-Frame Deconvolution (MF-Decon) framework that can exploit the spatiotemporal coherence inherent in the CEUS data, with new spatial and temporal regularisers designed based on total variation (TV) and regularisation by denoising (RED). Based on the MF-Decon framework, we introduce two novel methods: MF-Decon with spatial and temporal TVs (MF-Decon+3DTV) and MF-Decon with spatial RED and temporal TV (MF-Decon+RED+TV). Results from in silico simulations indicate that our methods outperform two widely used methods using deconvolution or normalised cross-correlation across all evaluation metrics, including precision, recall, $F_1$ score, mean and standard localisation errors. In particular, our methods improve MB localisation precision by up to 39% and recall by up to 12%. Super-resolution microvasculature maps generated with our methods on a publicly available in vivo rat brain dataset show less noise, better contrast, higher resolution and more vessel structures. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 26 pages, 1 table, 7 figures

arXiv:2407.05168 [pdf, other]

Deception in Nash Equilibrium Seeking

Authors: Michael Tang, Umar Javed, Xudong Chen, Miroslav Krstic, Jorge I. Poveda

Abstract: In socio-technical multi-agent systems, deception exploits privileged information to induce false beliefs in "victims," keeping them oblivious and leading to outcomes detrimental to them or advantageous to the deceiver. We consider model-free Nash-equilibrium-seeking for non-cooperative games with asymmetric information and introduce model-free deceptive algorithms with stability guarantees. In th… ▽ More In socio-technical multi-agent systems, deception exploits privileged information to induce false beliefs in "victims," keeping them oblivious and leading to outcomes detrimental to them or advantageous to the deceiver. We consider model-free Nash-equilibrium-seeking for non-cooperative games with asymmetric information and introduce model-free deceptive algorithms with stability guarantees. In the simplest algorithm, the deceiver includes in his action policy the victim's exploration signal, with an amplitude tuned by an integrator of the regulation error between the deceiver's actual and desired payoff. The integral feedback drives the deceiver's payoff to the payoff's reference value, while the victim is led to adopt a suboptimal action, at which the pseudogradient of the deceiver's payoff is zero. The deceiver's and victim's actions turn out to constitute a "deceptive" Nash equilibrium of a different game, whose structure is managed - in real time - by the deceiver. We examine quadratic, aggregative, and more general games and provide conditions for a successful deception, mutual and benevolent deception, and immunity to deception. Stability results are established using techniques based on averaging and singular perturbations. Among the examples in the paper is a microeconomic duopoly in which the deceiver induces in the victim a belief that the buyers disfavor the deceiver more than they actually do, leading the victim to increase the price above the Nash price, and resulting in an increased profit for the deceiver and a decreased profit for the victim. A study of the deceiver's integral feedback for the desired profit reveals that, in duopolies with equal marginal costs, a deceiver that is greedy for very high profit can attain any such profit, and pursue this with arbitrarily high integral gain (impatiently), irrespective of the market preference for the victim. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2406.18009 [pdf, other]

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16317 [pdf]

SNR-Progressive Model with Harmonic Compensation for Low-SNR Speech Enhancement

Authors: Zhongshu Hou, Tong Lei, Qinwen Hu, Zhanzhong Cao, Ming Tang, Jing Lu

Abstract: Despite significant progress made in the last decade, deep neural network (DNN) based speech enhancement (SE) still faces the challenge of notable degradation in the quality of recovered speech under low signal-to-noise ratio (SNR) conditions. In this letter, we propose an SNR-progressive speech enhancement model with harmonic compensation for low-SNR SE. Reliable pitch estimation is obtained from… ▽ More Despite significant progress made in the last decade, deep neural network (DNN) based speech enhancement (SE) still faces the challenge of notable degradation in the quality of recovered speech under low signal-to-noise ratio (SNR) conditions. In this letter, we propose an SNR-progressive speech enhancement model with harmonic compensation for low-SNR SE. Reliable pitch estimation is obtained from the intermediate output, which has the benefit of retaining more speech components than the coarse estimate while possessing a significant higher SNR than the input noisy speech. An effective harmonic compensation mechanism is introduced for better harmonic recovery. Extensive ex-periments demonstrate the advantage of our proposed model. A multi-modal speech extraction system based on the proposed backbone model ranks first in the ICASSP 2024 MISP Challenge: https://mispchallenge.github.io/mispchallenge2023/index.html. △ Less

Submitted 18 August, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.15885 [pdf, other]

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

Authors: Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, Hai Zhao

Abstract: Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-rel… ▽ More Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: Accepted to ACL-Findings 2024

arXiv:2406.05699 [pdf, ps, other]

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda

Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH2024

arXiv:2406.04281 [pdf, other]

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda

Abstract: Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations a… ▽ More Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations are predicted not only from the text input but also from an additional input of the total target duration. We also propose a MaskGIT-based duration model that enhances the diversity and quality of the predicted phoneme durations. Our results demonstrate that the proposed TDA duration models achieve better intelligibility and speaker similarity for various speech rate configurations compared to the baseline models. We also show that the proposed MaskGIT-based model can generate phoneme durations with higher quality and diversity compared to its regression or flow-matching counterparts. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2405.19685 [pdf]

Identifying Functional Brain Networks of Spatiotemporal Wide-Field Calcium Imaging Data via a Long Short-Term Memory Autoencoder

Authors: Xiaohui Zhang, Eric C Landsness, Lindsey M Brier, Wei Chen, Michelle J. Tang, Hanyang Miao, Jin-Moo Lee, Mark A. Anastasio, Joseph P. Culver

Abstract: Wide-field calcium imaging (WFCI) that records neural calcium dynamics allows for identification of functional brain networks (FBNs) in mice that express genetically encoded calcium indicators. Estimating FBNs from WFCI data is commonly achieved by use of seed-based correlation (SBC) analysis and independent component analysis (ICA). These two methods are conceptually distinct and each possesses l… ▽ More Wide-field calcium imaging (WFCI) that records neural calcium dynamics allows for identification of functional brain networks (FBNs) in mice that express genetically encoded calcium indicators. Estimating FBNs from WFCI data is commonly achieved by use of seed-based correlation (SBC) analysis and independent component analysis (ICA). These two methods are conceptually distinct and each possesses limitations. Recent success of unsupervised representation learning in neuroimage analysis motivates the investigation of such methods to identify FBNs. In this work, a novel approach referred as LSTM-AER, is proposed in which a long short-term memory (LSTM) autoencoder (AE) is employed to learn spatial-temporal latent embeddings from WFCI data, followed by an ordinary least square regression (R) to estimate FBNs. The goal of this study is to elucidate and illustrate, qualitatively and quantitatively, the FBNs identified by use of the LSTM-AER method and compare them to those from traditional SBC and ICA. It was observed that spatial FBN maps produced from LSTM-AER resembled those derived by SBC and ICA while better accounting for intra-subject variation, data from a single hemisphere, shorter epoch lengths and tunable number of latent components. The results demonstrate the potential of unsupervised deep learning-based approaches to identifying and mapping FBNs. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.04253 [pdf]

Fermat Number Transform Based Chromatic Dispersion Compensation and Adaptive Equalization Algorithm

Authors: Siyu Chen, Zheli Liu, Weihao Li, Zihe Hu, Mingming Zhang, Sheng Cui, Ming Tang

Abstract: By introducing the Fermat number transform into chromatic dispersion compensation and adaptive equalization, the computational complexity has been reduced by 68% compared with the con?ventional implementation. Experimental results validate its transmission performance with only 0.8 dB receiver sensitivity penalty in a 75 km-40 GBaud-PDM-16QAM system. By introducing the Fermat number transform into chromatic dispersion compensation and adaptive equalization, the computational complexity has been reduced by 68% compared with the con?ventional implementation. Experimental results validate its transmission performance with only 0.8 dB receiver sensitivity penalty in a 75 km-40 GBaud-PDM-16QAM system. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2403.19996 [pdf, other]

DeepHeteroIoT: Deep Local and Global Learning over Heterogeneous IoT Sensor Data

Authors: Muhammad Sakib Khan Inan, Kewen Liao, Haifeng Shen, Prem Prakash Jayaraman, Dimitrios Georgakopoulos, Ming Jian Tang

Abstract: Internet of Things (IoT) sensor data or readings evince variations in timestamp range, sampling frequency, geographical location, unit of measurement, etc. Such presented sequence data heterogeneity makes it difficult for traditional time series classification algorithms to perform well. Therefore, addressing the heterogeneity challenge demands learning not only the sub-patterns (local features) b… ▽ More Internet of Things (IoT) sensor data or readings evince variations in timestamp range, sampling frequency, geographical location, unit of measurement, etc. Such presented sequence data heterogeneity makes it difficult for traditional time series classification algorithms to perform well. Therefore, addressing the heterogeneity challenge demands learning not only the sub-patterns (local features) but also the overall pattern (global feature). To address the challenge of classifying heterogeneous IoT sensor data (e.g., categorizing sensor data types like temperature and humidity), we propose a novel deep learning model that incorporates both Convolutional Neural Network and Bi-directional Gated Recurrent Unit to learn local and global features respectively, in an end-to-end manner. Through rigorous experimentation on heterogeneous IoT sensor datasets, we validate the effectiveness of our proposed model, which outperforms recent state-of-the-art classification methods as well as several machine learning and deep learning baselines. In particular, the model achieves an average absolute improvement of 3.37% in Accuracy and 2.85% in F1-Score across datasets △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Accepted for Publication and Presented in EAI MobiQuitous 2023 - 20th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services

arXiv:2402.07383 [pdf, other]

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

Abstract: Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing an… ▽ More Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through objective and subjective evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples. △ Less

Submitted 4 March, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

Comments: See https://aka.ms/elate/ for demo samples, v2: subjective evaluation has been added

arXiv:2401.08887 [pdf, ps, other]

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

Authors: Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

Abstract: We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First… ▽ More We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: preprint

arXiv:2401.08098 [pdf]

Attention-Based CNN-BiLSTM for Sleep State Classification of Spatiotemporal Wide-Field Calcium Imaging Data

Authors: Xiaohui Zhang, Eric C. Landsness, Hanyang Miao, Wei Chen, Michelle Tang, Lindsey M. Brier, Joseph P. Culver, Jin-Moo Lee, Mark A. Anastasio

Abstract: Background: Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators allows for spatiotemporal recordings of neuronal activity in mice. When applied to the study of sleep, WFCI data are manually scored into the sleep states of wakefulness, non-REM (NREM) and REM by use of adjunct EEG and EMG recordings. However, this process is time-consuming, invasive and often suffers from l… ▽ More Background: Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators allows for spatiotemporal recordings of neuronal activity in mice. When applied to the study of sleep, WFCI data are manually scored into the sleep states of wakefulness, non-REM (NREM) and REM by use of adjunct EEG and EMG recordings. However, this process is time-consuming, invasive and often suffers from low inter- and intra-rater reliability. Therefore, an automated sleep state classification method that operates on spatiotemporal WFCI data is desired. New Method: A hybrid network architecture consisting of a convolutional neural network (CNN) to extract spatial features of image frames and a bidirectional long short-term memory network (BiLSTM) with attention mechanism to identify temporal dependencies among different time points was proposed to classify WFCI data into states of wakefulness, NREM and REM sleep. Results: Sleep states were classified with an accuracy of 84% and Cohen's kappa of 0.64. Gradient-weighted class activation maps revealed that the frontal region of the cortex carries more importance when classifying WFCI data into NREM sleep while posterior area contributes most to the identification of wakefulness. The attention scores indicated that the proposed network focuses on short- and long-range temporal dependency in a state-specific manner. Comparison with Existing Method: On a 3-hour WFCI recording, the CNN-BiLSTM achieved a kappa of 0.67, comparable to a kappa of 0.65 corresponding to the human EEG/EMG-based scoring. Conclusions: The CNN-BiLSTM effectively classifies sleep states from spatiotemporal WFCI data and will enable broader application of WFCI in sleep. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2312.10418 [pdf, other]

Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing

Authors: Lyudong Jin, Ming Tang, Meng Zhang, Hao Wang

Abstract: Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form.… ▽ More Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning(RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks. △ Less

Submitted 19 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

arXiv:2311.08823 [pdf, other]

Ultrafast 3-D Super Resolution Ultrasound using Row-Column Array specific Coherence-based Beamforming and Rolling Acoustic Sub-aperture Processing: In Vitro, In Vivo and Clinical Study

Authors: Joseph Hansen-Shearer, Jipeng Yan, Marcelo Lerendegui, Biao Huang, Matthieu Toulemonde, Kai Riemer, Qingyuan Tan, Johanna Tonko, Peter D. Weinberg, Chris Dunsby, Meng-Xing Tang

Abstract: The row-column addressed array is an emerging probe for ultrafast 3-D ultrasound imaging. It achieves this with far fewer independent electronic channels and a wider field of view than traditional 2-D matrix arrays, of the same channel count, making it a good candidate for clinical translation. However, the image quality of row-column arrays is generally poor, particularly when investigating tissu… ▽ More The row-column addressed array is an emerging probe for ultrafast 3-D ultrasound imaging. It achieves this with far fewer independent electronic channels and a wider field of view than traditional 2-D matrix arrays, of the same channel count, making it a good candidate for clinical translation. However, the image quality of row-column arrays is generally poor, particularly when investigating tissue. Ultrasound localisation microscopy allows for the production of super-resolution images even when the initial image resolution is not high. Unfortunately, the row-column probe can suffer from imaging artefacts that can degrade the quality of super-resolution images as `secondary' lobes from bright microbubbles can be mistaken as microbubble events, particularly when operated using plane wave imaging. These false events move through the image in a physiologically realistic way so can be challenging to remove via tracking, leading to the production of 'false vessels'. Here, a new type of rolling window image reconstruction procedure was developed, which integrated a row-column array-specific coherence-based beamforming technique with acoustic sub-aperture processing for the purposes of reducing `secondary' lobe artefacts, noise and increasing the effective frame rate. Using an {\it{in vitro}} cross tube, it was found that the procedure reduced the percentage of `false' locations from $\sim$26\% to $\sim$15\% compared to traditional orthogonal plane wave compounding. Additionally, it was found that the noise could be reduced by $\sim$7 dB and that the effective frame rate could be increased to over 4000 fps. Subsequently, {\it{in vivo}} ultrasound localisation microscopy was used to produce images non-invasively of a rabbit kidney and a human thyroid. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2308.13575 [pdf]

FrFT based estimation of linear and nonlinear impairments using Vision Transformer

Authors: Ting Jiang, Zheng Gao, Yizhao Chen, Zihe Hu, Ming Tang

Abstract: To comprehensively assess optical fiber communication system conditions, it is essential to implement joint estimation of the following four critical impairments: nonlinear signal-to-noise ratio (SNRNL), optical signal-to-noise ratio (OSNR), chromatic dispersion (CD) and differential group delay (DGD). However, current studies only achieve identifying a limited number of impairments within a narro… ▽ More To comprehensively assess optical fiber communication system conditions, it is essential to implement joint estimation of the following four critical impairments: nonlinear signal-to-noise ratio (SNRNL), optical signal-to-noise ratio (OSNR), chromatic dispersion (CD) and differential group delay (DGD). However, current studies only achieve identifying a limited number of impairments within a narrow range, due to limitations in network capabilities and lack of unified representation of impairments. To address these challenges, we adopt time-frequency signal processing based on fractional Fourier transform (FrFT) to achieve the unified representation of impairments, while employing a Transformer based neural networks (NN) to break through network performance limitations. To verify the effectiveness of the proposed estimation method, the numerical simulation is carried on a 5-channel polarization-division-multiplexed quadrature phase shift keying (PDM-QPSK) long haul optical transmission system with the symbol rate of 50 GBaud per channel, the mean absolute error (MAE) for SNRNL, OSNR, CD, and DGD estimation is 0.091 dB, 0.058 dB, 117 ps/nm, and 0.38 ps, and the monitoring window ranges from 0~20 dB, 10~30 dB, 0~51000 ps/nm, and 0~100 ps, respectively. Our proposed method achieves accurate estimation of linear and nonlinear impairments over a broad range, representing a significant advancement in the field of optical performance monitoring (OPM). △ Less

Submitted 25 August, 2023; originally announced August 2023.

Comments: 15 pages, 10 figures

arXiv:2308.06873 [pdf, other]

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Abstract: Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile… ▽ More Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples. △ Less

Submitted 25 June, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

Comments: To appear in TASLP. See https://aka.ms/speechx for demo samples

arXiv:2308.04683 [pdf, other]

Real-time FPGA Implementation of CNN-based Distributed Fiber Optic Vibration Event Recognition Method

Authors: Zhongyao Luo, Zhao Ge, Hao Wu, Ming Tang

Abstract: Utilizing optical fibers to detect and pinpoint vibrations, Distributed Optical Fiber Vibration Sensing (DVS) technology provides real-time monitoring and surveillance of wide-reaching areas. This field has been leveraging Convolutional Neural Networks (CNN). Recently, a study has accomplished end-to-end vibration event recognition, enabling utilization of CNN-based DVS algorithms as real-time emb… ▽ More Utilizing optical fibers to detect and pinpoint vibrations, Distributed Optical Fiber Vibration Sensing (DVS) technology provides real-time monitoring and surveillance of wide-reaching areas. This field has been leveraging Convolutional Neural Networks (CNN). Recently, a study has accomplished end-to-end vibration event recognition, enabling utilization of CNN-based DVS algorithms as real-time embedded system for edge computing in practical application situations. Considering the power consumption of central processing unit (CPU) and graphics processing unit (GPU), and the inflexibility of application-specific integrated circuit (ASIC), field-Programmable gate array (FPGA) is the optimal computing platform for the system. This paper proposes to compress pre-trained network and adopt a novel hardware structure, to design a fully on-chip, pipelined inference accelerator for CNN-based DVS algorithm, without fine tuning or re-training. This design allows for real-time processing with low power consumption and system requirement.An examination has been executed on an existing DVS algorithm based on a 40-layer CNN model comprising 2.7 million parameters. It is completely implemented on-chip, pipelined, with no reduction in accuracy. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 5 pages, 6 figures

arXiv:2308.04013 [pdf, other]

Distributed Target Tracking with Fading Channels over Underwater Wireless Sensor Networks

Authors: Miaoyi Tang, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong

Abstract: This paper investigates the problem of distributed target tracking via underwater wireless sensor networks (UWSNs) with fading channels. The degradation of signal quality due to wireless channel fading can significantly impact network reliability and subsequently reduce the tracking accuracy. To address this issue, we propose a modified distributed unscented Kalman filter (DUKF) named DUKF-Fc, whi… ▽ More This paper investigates the problem of distributed target tracking via underwater wireless sensor networks (UWSNs) with fading channels. The degradation of signal quality due to wireless channel fading can significantly impact network reliability and subsequently reduce the tracking accuracy. To address this issue, we propose a modified distributed unscented Kalman filter (DUKF) named DUKF-Fc, which takes into account the effects of measurement fluctuation and transmission failure induced by channel fading. The channel estimation error is also considered when designing the estimator and a sufficient condition is established to ensure the stochastic boundedness of the estimation error. The proposed filtering scheme is versatile and possesses wide applicability to numerous real-world scenarios, e.g., tracking a maneuvering underwater target with acoustic sensors. Simulation results demonstrate the effectiveness of the proposed filtering algorithm. In addition, considering the constraints of network energy resources, the issue of investigating a trade-off between tracking performance and energy consumption is discussed accordingly. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 12 pages, 6 figures, 6 tables

arXiv:2304.12783 [pdf, other]

On the Use of Singular Value Decomposition as a Clutter Filter for Ultrasound Flow Imaging

Authors: Kai Riemer, Marcelo Lerendegui, Matthieu Toulemonde, Jiaqi Zhu, Christopher Dunsby, Peter D. Weinberg, Meng-Xing Tang

Abstract: Filtering based on Singular Value Decomposition (SVD) provides substantial separation of clutter, flow and noise in high frame rate ultrasound flow imaging. The use of SVD as a clutter filter has greatly improved techniques such as vector flow imaging, functional ultrasound and super-resolution ultrasound localization microscopy. The removal of clutter and noise relies on the assumption that tissu… ▽ More Filtering based on Singular Value Decomposition (SVD) provides substantial separation of clutter, flow and noise in high frame rate ultrasound flow imaging. The use of SVD as a clutter filter has greatly improved techniques such as vector flow imaging, functional ultrasound and super-resolution ultrasound localization microscopy. The removal of clutter and noise relies on the assumption that tissue, flow and noise are each represented by different subsets of singular values, so that their signals are uncorrelated and lay on orthogonal sub-spaces. This assumption fails in the presence of tissue motion, for near-wall or microvascular flow, and can be influenced by an incorrect choice of singular value thresholds. Consequently, separation of flow, clutter and noise is imperfect, which can lead to image artefacts not present in the original data. Temporal and spatial fluctuation in intensity are the commonest artefacts, which vary in appearance and strengths. Ghosting and splitting artefacts are observed in the microvasculature where the flow signal is sparsely distributed. Singular value threshold selection, tissue motion, frame rate, flow signal amplitude and acquisition length affect the prevalence of these artefacts. Understanding what causes artefacts due to SVD clutter and noise removal is necessary for their interpretation. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: 10 pages, 7 figures

arXiv:2304.00819 [pdf, other]

Acceleration-Based Kalman Tracking for Super-Resolution Ultrasound Imaging in vivo

Authors: Biao Huang, Jipeng Yan, Megan Morris, Victoria Sinnett, Navita Somaiah, Meng-Xing Tang

Abstract: Super-resolution ultrasound can image microvascular structure and flow at sub-wave-diffraction resolution based on localising and tracking microbubbles. Currently, tracking microbubbles accurately under limited imaging frame rates and high microbubble concentrations remains a challenge, especially under the effect of cardiac pulsatility and in highly curved vessels. In this study, an acceleration-… ▽ More Super-resolution ultrasound can image microvascular structure and flow at sub-wave-diffraction resolution based on localising and tracking microbubbles. Currently, tracking microbubbles accurately under limited imaging frame rates and high microbubble concentrations remains a challenge, especially under the effect of cardiac pulsatility and in highly curved vessels. In this study, an acceleration-incorporated microbubble motion model is introduced into a Kalman tracking framework. The tracking performance was evaluated using simulated microvasculature with different microbubble motion parameters and acquisition frame rates, and in vivo human breast tumour ultrasound datasets. The simulation results show that the acceleration-based method outperformed the non-acceleration-based method at different levels of acceleration and acquisition frame rates and achieved significant improvement in true positive rate (up to 10.03%), false negative rate (up to 28.61%) and correctly pairing fraction (up to 170.14%). The proposed method can also reduce errors in vasculature reconstruction via the acceleration-based nonlinear interpolation, compared with linear interpolation (up to 19 um). The tracking results from temporally downsampled low frame rate in vivo datasets from human breast tumours show that the proposed method has better microbubble tracking performance than the baseline method, if using results from the initial high frame data as reference. Finally, the acceleration estimated from tracking results also provides a spatial speed gradient map that may contain extra valuable diagnostic information. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: 15 pages, 10 figures

arXiv:2303.14003 [pdf]

Transthoracic super-resolution ultrasound localisation microscopy of myocardial vasculature in patients

Authors: Jipeng Yan, Biao Huang, Johanna Tonko, Matthieu Toulemonde, Joseph Hansen-Shearer, Qingyuan Tan, Kai Riemer, Konstantinos Ntagiantas, Rasheda A Chowdhury, Pier Lambiase, Roxy Senior, Meng-Xing Tang

Abstract: Micro-vascular flow in the myocardium is of significant importance clinically but remains poorly understood. Up to 25% of patients with symptoms of coronary heart diseases have no obstructive coronary arteries and have suspected microvascular diseases. However, such microvasculature is difficult to image in vivo with existing modalities due to the lack of resolution and sensitivity. Here, we demon… ▽ More Micro-vascular flow in the myocardium is of significant importance clinically but remains poorly understood. Up to 25% of patients with symptoms of coronary heart diseases have no obstructive coronary arteries and have suspected microvascular diseases. However, such microvasculature is difficult to image in vivo with existing modalities due to the lack of resolution and sensitivity. Here, we demonstrate the feasibility of transthoracic super-resolution ultrasound localisation microscopy (SRUS/ULM) of myocardial microvasculature and hemodynamics in a large animal model and in patients, using a cardiac phased array probe with a customised data acquisition and processing pipeline. A multi-level motion correction strategy was proposed. A tracking framework incorporating multiple features and automatic parameter initialisations was developed to reconstruct microcirculation. In two patients with impaired myocardial function, we have generated SRUS images of myocardial vascular structure and flow with a resolution that is beyond the wave-diffraction limit (half a wavelength), using data acquired within a breath hold. Myocardial SRUS/ULM has potential to improve the understanding of myocardial microcirculation and the management of patients with cardiac microvascular diseases. △ Less

Submitted 28 March, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: 22 pages, 10 figures

arXiv:2303.11510 [pdf, other]

ICASSP 2023 Deep Noise Suppression Challenge

Authors: Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, Mehdi Zohourian, Min Tang, Hannes Gamper, Mehrsa Golestaneh, Robert Aichner

Abstract: Deep Speech Enhancement Challenge is the 5th edition of deep noise suppression (DNS) challenges organized at ICASSP 2023 Signal Processing Grand Challenges. DNS challenges were organized during 2019-2023 to stimulate research in deep speech enhancement (DSE). Previous DNS challenges were organized at INTERSPEECH 2020, ICASSP 2021, INTERSPEECH 2021, and ICASSP 2022. From prior editions, we learnt t… ▽ More Deep Speech Enhancement Challenge is the 5th edition of deep noise suppression (DNS) challenges organized at ICASSP 2023 Signal Processing Grand Challenges. DNS challenges were organized during 2019-2023 to stimulate research in deep speech enhancement (DSE). Previous DNS challenges were organized at INTERSPEECH 2020, ICASSP 2021, INTERSPEECH 2021, and ICASSP 2022. From prior editions, we learnt that improving signal quality (SIG) is challenging particularly in presence of simultaneously active interfering talkers and noise. This challenge aims to develop models for joint denosing, dereverberation and suppression of interfering talkers. When primary talker wears a headphone, certain acoustic properties of their speech such as direct-to-reverberation (DRR), signal to noise ratio (SNR) etc. make it possible to suppress neighboring talkers even without enrollment data for primary talker. This motivated us to create two tracks for this challenge: (i) Track-1 Headset; (ii) Track-2 Speakerphone. Both tracks has fullband (48kHz) training data and testset, and each testclips has a corresponding enrollment data (10-30s duration) for primary talker. Each track invited submissions of personalized and non-personalized models all of which are evaluated through same subjective evaluation. Most models submitted to challenge were personalized models, same team is winner in both tracks where the best models has improvement of 0.145 and 0.141 in challenge's Score as compared to noisy blind testset. △ Less

Submitted 8 May, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: 6 pages, 1 figure. arXiv admin note: text overlap with arXiv:2202.13288

arXiv:2303.07005 [pdf, other]

Real-Time Audio-Visual End-to-End Speech Enhancement

Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang

Abstract: Audio-visual speech enhancement (AV-SE) methods utilize auxiliary visual cues to enhance speakers' voices. Therefore, technically they should be able to outperform the audio-only speech enhancement (SE) methods. However, there are few works in the literature on an AV-SE system that can work in real time on a CPU. In this paper, we propose a low-latency real-time audio-visual end-to-end enhancement… ▽ More Audio-visual speech enhancement (AV-SE) methods utilize auxiliary visual cues to enhance speakers' voices. Therefore, technically they should be able to outperform the audio-only speech enhancement (SE) methods. However, there are few works in the literature on an AV-SE system that can work in real time on a CPU. In this paper, we propose a low-latency real-time audio-visual end-to-end enhancement (AV-E3Net) model based on the recently proposed end-to-end enhancement network (E3Net). Our main contribution includes two aspects: 1) We employ a dense connection module to solve the performance degradation caused by the deep model structure. This module significantly improves the model's performance on the AV-SE task. 2) We propose a multi-stage gating-and-summation (GS) fusion module to merge audio and visual cues. Our results show that the proposed model provides better perceptual quality and intelligibility than the baseline E3net model with a negligible computational cost increase. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2211.09988 [pdf, ps, other]

Exploring WavLM on Speech Enhancement

Authors: Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, Min Tang, Jong Won Shin, Shujie Liu

Abstract: There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success. Especially, WavLM showed state-of-the-art performance on various speech processing tasks. To better understand the efficacy of self-supervised learning models for speech enhancement, in this work, we design and conduct a series of experiments with… ▽ More There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success. Especially, WavLM showed state-of-the-art performance on various speech processing tasks. To better understand the efficacy of self-supervised learning models for speech enhancement, in this work, we design and conduct a series of experiments with three resource conditions by combining WavLM and two high-quality speech enhancement systems. Also, we propose a regression-based WavLM training objective and a noise-mixing data configuration to further boost the downstream enhancement performance. The experiments on the DNS challenge dataset and a simulation dataset show that the WavLM benefits the speech enhancement task in terms of both speech quality and speech recognition accuracy, especially for low fine-tuning resources. For the high fine-tuning resource condition, only the word error rate is substantially improved. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Accepted by IEEE SLT 2022

arXiv:2211.02773 [pdf, other]

Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation

Authors: Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang

Abstract: Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that ar… ▽ More Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that are applicable to various model architectures to develop efficient causal models that can handle the tasks of PSE, AEC, and joint PSE-AEC. We present extensive evaluation results using both simulated data and real recordings, covering various acoustic conditions and evaluation metrics. The results show the effectiveness of the proposed methods for two different model architectures. Our best joint PSE-AEC model comes close to the expert models optimized for individual tasks of PSE and AEC in their respective scenarios and significantly outperforms the expert models for the combined PSE-AEC task. △ Less

Submitted 25 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

Comments: Accepted to Interspeech 2023

arXiv:2211.00754 [pdf, other]

BUbble Flow Field: a Simulation Framework for Evaluating Ultrasound Localization Microscopy Algorithms

Authors: Marcelo Lerendegui, Kai Riemer, Bingxue Wang, Christopher Dunsby, Meng-Xing Tang

Abstract: Ultrasound contrast enhanced imaging has seen widespread uptake in research and clinical diagnostic imaging. This includes applications such as vector flow imaging, functional ultrasound and super-resolution Ultrasound Localization Microscopy (ULM). All of these require testing and validation during development of new algorithms with ground truth data. In this work we present a comprehensive simul… ▽ More Ultrasound contrast enhanced imaging has seen widespread uptake in research and clinical diagnostic imaging. This includes applications such as vector flow imaging, functional ultrasound and super-resolution Ultrasound Localization Microscopy (ULM). All of these require testing and validation during development of new algorithms with ground truth data. In this work we present a comprehensive simulation platform BUbble Flow Field (BUFF) that generates contrast enhanced ultrasound images in vascular tree geometries with realistic flow characteristics and validation algorithms for ULM. BUFF allows complex micro-vascular network generation of random and user-defined vascular networks. Blood flow is simulated with a fast Computational Fluid Dynamics (CFD) solver and allows arbitrary input and output positions and custom pressures. The acoustic field simulation is combined with non-linear Microbubble (MB) dynamics and simulates a range of point spread functions based on user-defined MB characteristics. The validation combines both binary and quantitative metrics. BFF's capacity to generate and validate user-defined networks is demonstrated through its implementation in the Ultrasound Localisation and TRacking Algorithms for Super Resolution (ULTRA-SR) Challenge at the International Ultrasonics Symposium (IUS) 2022 of the Institute of Electrical and Electronics Engineers (IEEE). The ability to produce ULM images, and the availability of a ground truth in localisation and tracking enables objective and quantitative evaluation of the large number of localisation and tracking algorithms developed in the field. BUFF can also benefit deep learning based methods by automatically generating datasets for training. BUFF is a fully comprehensive simulation platform for testing and validation of novel ULM techniques and is open source. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: 10 Pages, 9 Figures

arXiv:2210.00801 [pdf, other]

Design of the PID temperature controller for an alkaline electrolysis system with time delays

Authors: Ruomei Qi, Jiarong Li, Jin Lin, Yonghua Song, Jiepeng Wang, Qiangqiang Cui, Yiwei Qiu, Ming Tang, Jian Wang

Abstract: Electrolysis systems use proportional-integral-derivative (PID) temperature controllers to maintain stack temperatures around set points. However, heat transfer delays in electrolysis systems cause manual tuning of PID temperature controllers to be time-consuming, and temperature oscillations often occur. This paper focuses on the design of the PID temperature controller for an alkaline electrolys… ▽ More Electrolysis systems use proportional-integral-derivative (PID) temperature controllers to maintain stack temperatures around set points. However, heat transfer delays in electrolysis systems cause manual tuning of PID temperature controllers to be time-consuming, and temperature oscillations often occur. This paper focuses on the design of the PID temperature controller for an alkaline electrolysis system to achieve fast and stable temperature control. A thermal dynamic model of an electrolysis system is established in the frequency-domain for controller designs. Based on this model, the temperature stability is analysed by the root distribution, and the PID parameters are optimized considering both the temperature overshoot and the settling time. The performance of the optimal PID controllers is verified through experiments. Furthermore, the simulation results show that the before-stack temperature should be used as the feedback variable for small lab-scale systems to suppress stack temperature fluctuations, and the after-stack temperature should be used for larger systems to improve the economy. This study is helpful in ensuring the temperature stability and control of electrolysis systems. △ Less

Submitted 3 October, 2022; originally announced October 2022.

arXiv:2209.10382 [pdf, other]

Robust Information Bottleneck for Task-Oriented Communication with Digital Modulation

Authors: Songjie Xie, Shuai Ma, Ming Ding, Yuanming Shi, Mingjian Tang, Youlong Wu

Abstract: Task-oriented communications, mostly using learning-based joint source-channel coding (JSCC), aim to design a communication-efficient edge inference system by transmitting task-relevant information to the receiver. However, only transmitting task-relevant information without introducing any redundancy may cause robustness issues in learning due to the channel variations, and the JSCC which directl… ▽ More Task-oriented communications, mostly using learning-based joint source-channel coding (JSCC), aim to design a communication-efficient edge inference system by transmitting task-relevant information to the receiver. However, only transmitting task-relevant information without introducing any redundancy may cause robustness issues in learning due to the channel variations, and the JSCC which directly maps the source data into continuous channel input symbols poses compatibility issues on existing digital communication systems. In this paper, we address these two issues by first investigating the inherent tradeoff between the informativeness of the encoded representations and the robustness to information distortion in the received representations, and then propose a task-oriented communication scheme with digital modulation, named discrete task-oriented JSCC (DT-JSCC), where the transmitter encodes the features into a discrete representation and transmits it to the receiver with the digital modulation scheme. In the DT-JSCC scheme, we develop a robust encoding framework, named robust information bottleneck (RIB), to improve the communication robustness to the channel variations, and derive a tractable variational upper bound of the RIB objective function using the variational approximation to overcome the computational intractability of mutual information. The experimental results demonstrate that the proposed DT-JSCC achieves better inference performance than the baseline methods with low communication latency, and exhibits robustness to channel variations due to the applied RIB framework. △ Less

Submitted 9 May, 2023; v1 submitted 21 September, 2022; originally announced September 2022.

arXiv:2208.12176 [pdf, other]

doi 10.1109/TBME.2023.3263369

3D Super-Resolution Ultrasound with Adaptive Weight-Based Beamforming

Authors: Jipeng Yan, Bingxue Wang, Kai Riemer, Joseph Hansen-Shearer, Marcelo Lerendegui, Matthieu Toulemonde, Christopher J Rowlands, Peter D. Weinberg, Meng-Xing Tang

Abstract: Super-resolution ultrasound (SRUS) imaging through localising and tracking sparse microbubbles has been shown to reveal microvascular structure and flow beyond the wave diffraction limit. Most SRUS studies use standard delay and sum (DAS) beamforming, where large main lobe and significant side lobes make separation and localisation of densely distributed bubbles challenging, particularly in 3D due… ▽ More Super-resolution ultrasound (SRUS) imaging through localising and tracking sparse microbubbles has been shown to reveal microvascular structure and flow beyond the wave diffraction limit. Most SRUS studies use standard delay and sum (DAS) beamforming, where large main lobe and significant side lobes make separation and localisation of densely distributed bubbles challenging, particularly in 3D due to the typically small aperture of matrix array probes. This study aims to improve 3D SRUS by implementing a low-cost 3D coherence beamformer based on channel signal variance, as well as two other adaptive weight-based coherence beamformers: nonlinear beamforming with p-th root compression and coherence factor. The 3D coherence beamformers, together with DAS, are compared in computer simulation, on a microflow phantom, and in vivo. Simulation results demonstrate that the adaptive weight-based beamformers can significantly narrow the main lobe and suppress the side lobes for modest computational cost. Significantly improved 3D SR images of microflow phantom and a rabbit kidney are obtained through the adaptive weight-based beamformers. The proposed variance-based beamformer performs best in simulations and experiments. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Ultrasound localisation microscopy (ULM), super-resolution, contrast-enhanced ultrasound, 3D beamforming

arXiv:2207.07303 [pdf, other]

doi 10.1007/978-3-030-92273-3_45

Towards Better Dermoscopic Image Feature Representation Learning for Melanoma Classification

Authors: ChengHui Yu, MingKang Tang, ShengGe Yang, MingQing Wang, Zhe Xu, JiangPeng Yan, HanMo Chen, Yu Yang, Xiao-Jun Zeng, Xiu Li

Abstract: Deep learning-based melanoma classification with dermoscopic images has recently shown great potential in automatic early-stage melanoma diagnosis. However, limited by the significant data imbalance and obvious extraneous artifacts, i.e., the hair and ruler markings, discriminative feature extraction from dermoscopic images is very challenging. In this study, we seek to resolve these problems resp… ▽ More Deep learning-based melanoma classification with dermoscopic images has recently shown great potential in automatic early-stage melanoma diagnosis. However, limited by the significant data imbalance and obvious extraneous artifacts, i.e., the hair and ruler markings, discriminative feature extraction from dermoscopic images is very challenging. In this study, we seek to resolve these problems respectively towards better representation learning for lesion features. Specifically, a GAN-based data augmentation (GDA) strategy is adapted to generate synthetic melanoma-positive images, in conjunction with the proposed implicit hair denoising (IHD) strategy. Wherein the hair-related representations are implicitly disentangled via an auxiliary classifier network and reversely sent to the melanoma-feature extraction backbone for better melanoma-specific representation learning. Furthermore, to train the IHD module, the hair noises are additionally labeled on the ISIC2020 dataset, making it the first large-scale dermoscopic dataset with annotation of hair-like artifacts. Extensive experiments demonstrate the superiority of the proposed framework as well as the effectiveness of each component. The improved dataset publicly avaliable at https://github.com/kirtsy/DermoscopicDataset. △ Less

Submitted 15 July, 2022; originally announced July 2022.

Comments: ICONIP 2021 conference

arXiv:2206.03912 [pdf, other]

Volumetric Image Projection Super-Resolution Ultrasound (VIP-SR) with a 1D Unfocused Linear Array

Authors: B. Wang, K. Riemer, M. Toulemonde, J. Yan, X. Zhou, M. Tang

Abstract: Super-Resolution Ultrasound (SRUS) through localizing spatially isolated microbubbles has been demonstrated to overcome the wave diffraction limit and reveal the microvascular structure and flow information at the microscopic scale. However, 3D SRUS imaging remains a challenge due to the fabrication and computational complexity of 2D matrix array probes and connections. Inspired by X-ray radiograp… ▽ More Super-Resolution Ultrasound (SRUS) through localizing spatially isolated microbubbles has been demonstrated to overcome the wave diffraction limit and reveal the microvascular structure and flow information at the microscopic scale. However, 3D SRUS imaging remains a challenge due to the fabrication and computational complexity of 2D matrix array probes and connections. Inspired by X-ray radiography which can present volumetric information in a single projection image with much simpler hardware than X-ray CT, this study investigates the feasibility of volumetric image projection super-resolution (VIP-SR) ultrasound using a 1D unfocused linear array. Both simulation and experiments were conducted on 3D microvessel phantoms using a 1D linear array with or without an elevational focus, and a 2D matrix array as the reference. Results show that, VIP-SR, using an unfocused 1D array probe can capture significantly more volumetric information than the conventional 1D elevational focused probe. Compared with the 2D projection image of the full 3D SRUS results using the 2D array probe with the same aperture size, VIP-SR has similar volumetric coverage using 32 folds less independent elements. The impact of bubble concentration and vascular density on the VIP-SR US was also investigated. This study demonstrates the ability of high-resolution volumetric imaging of microvascular structures at significantly reduced costs with VIP-SR. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: 19 pages, 9 figures

arXiv:2203.09461 [pdf]

Beyond the Limitation of Pulse Width in Optical Time-domain Reflectometry

Authors: Hao Wu, Ming Tang

Abstract: Optical time-domain reflectometry (OTDR) is the basis for distributed time-domain optical fiber sensing techniques. By injecting pulse light into an optical fiber, the distance information of an event can be obtained based on the time of light flight. The minimum distinguishable event separation along the fiber length is called the spatial resolution, which is determined by the optical pulse width… ▽ More Optical time-domain reflectometry (OTDR) is the basis for distributed time-domain optical fiber sensing techniques. By injecting pulse light into an optical fiber, the distance information of an event can be obtained based on the time of light flight. The minimum distinguishable event separation along the fiber length is called the spatial resolution, which is determined by the optical pulse width. By reducing the pulse width, the spatial resolution can be improved. However, at the same time, the signal-to-noise ratio of the system is degraded, and higher speed equipment is required. To solve this problem, data processing methods such as iterative subdivision, deconvolution, and neural networks have been proposed. However, they all have some shortcomings and thus have not been widely applied. Here, we propose and experimentally demonstrate an OTDR deconvolution neural network based on deep convolutional neural networks. A simplified OTDR model is built to generate a large amount of training data. By optimizing the network structure and training data, an effective OTDR deconvolution is achieved. The simulation and experimental results show that the proposed neural network can achieve more accurate deconvolution than the conventional deconvolution algorithm with a higher signal-to-noise ratio. △ Less

Submitted 13 March, 2022; originally announced March 2022.

arXiv:2203.04263 [pdf, other]

doi 10.1109/TMI.2022.3223554

Fast and selective super-resolution ultrasound in vivo with sono-switchable nanodroplets

Authors: Kai Riemer, Matthieu Toulemonde, Jipeng Yan, Marcelo Lerendegui, Eleanor Stride, Peter D. Weinberg, Christopher Dunsby, Meng-Xing Tang

Abstract: Perfusion by the microcirculation is key to the development, maintenance and pathology of tissue. Its measurement with high spatiotemporal resolution is consequently valuable but remains a challenge in deep tissue. Ultrasound Localization Microscopy (ULM) provides very high spatiotemporal resolution but the use of microbubbles requires low contrast agent concentrations, a long acquisition time, an… ▽ More Perfusion by the microcirculation is key to the development, maintenance and pathology of tissue. Its measurement with high spatiotemporal resolution is consequently valuable but remains a challenge in deep tissue. Ultrasound Localization Microscopy (ULM) provides very high spatiotemporal resolution but the use of microbubbles requires low contrast agent concentrations, a long acquisition time, and gives little control over the spatial and temporal distribution of the bubbles. The present study is the first to demonstrate Acoustic Wave Sparsely-Activated Localization Microscopy (AWSALM) and fast-AWSALM for in vivo super-resolution ultrasound imaging, offering contrast on demand and vascular selectivity. Three different formulations of sono-switchable contrast agents were tested. We demonstrate their use with ultrasound mechanical indices well within recommended safety limits to enable fast on-demand sparse switching at very high agent concentrations. We produce super-localization maps of the rabbit renal vasculature with acquisition times between 5.5 s and 0.25 s, and an 4-fold improvement in spatial resolution. We present the unique selectivity of AWSALM in visualizing specific vascular branches and downstream microvasculature, and we show super-localized kidney structures in systole and diastole with fast-AWSALM. In conclusion we demonstrate the feasibility of fast and selective measurement of microvascular dynamics in vivo with subwavelength resolution using ultrasound and sono-switchable nanodroplets. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: phase-change contrast agent, low-boiling point nanodroplet, acoustic vaporization, droplet activation, microcirculation, contrast enhanced ultrasound, plane wave

arXiv:2202.13422 [pdf, other]

Thermal Modelling and Controller Design of an Alkaline Electrolysis System under Dynamic Operating Conditions

Authors: Ruomei Qi, Jiarong Li, Jin Lin, Yonghua Song, Jiepeng Wang, Qiangqiang Cui, Yiwei Qiu, Ming Tang, Jian Wang

Abstract: Thermal management is vital for the efficient and safe operation of alkaline electrolysis systems. Traditional alkaline electrolysis systems use simple proportional-integral-differentiation (PID) controllers to maintain the stack temperature near the rated value. However, in renewable-to-hydrogen scenarios, the stack temperature is disturbed by load fluctuations, and the temperature overshoot phen… ▽ More Thermal management is vital for the efficient and safe operation of alkaline electrolysis systems. Traditional alkaline electrolysis systems use simple proportional-integral-differentiation (PID) controllers to maintain the stack temperature near the rated value. However, in renewable-to-hydrogen scenarios, the stack temperature is disturbed by load fluctuations, and the temperature overshoot phenomenon occurs which can exceed the upper limit and harm the stack. This paper focuses on the thermal modelling and controller design of an alkaline electrolysis system under dynamic operating conditions. A control-oriented thermal model is established in the form of a third-order time-delay process, which is used for simulation and controller design. Based on this model, we propose two novel controllers to reduce temperature overshoot: one is a current feed-forward PID controller (PID-I), the other is a model predictive controller (MPC). Their performances are tested on a lab-scale system and the experimental results are satisfying: the temperature overshoot is reduced by 2.2 degree with the PID-I controller, and no obvious overshoot is observed with the MPC controller. Furthermore, the thermal dynamic performance of an MW-scale alkaline electrolysis system is analyzed by simulation, which shows that the temperature overshoot phenomenon is more general in large systems. The proposed method allows for higher temperature set points which can improve system efficiency by 1%. △ Less

Submitted 27 February, 2022; originally announced February 2022.

arXiv:2110.05745 [pdf, other]

VarArray: Array-Geometry-Agnostic Continuous Speech Separation

Authors: Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo Chen, Naoyuki Kanda

Abstract: Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input… ▽ More Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels. The proposed method adapts different elements that were proposed before separately, including transform-average-concatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way. Large-scale evaluation was performed with two real meeting transcription tasks by using a fully developed transcription system requiring no prior knowledge such as reference segmentations, which allowed us to measure the impact that the continuous speech separation system could have in realistic settings. The proposed model outperformed a previous approach to array-geometry-agnostic modeling for all of the geometry configurations considered, achieving asclite-based speaker-agnostic word error rates of 17.5% and 20.4% for the AMI development and evaluation sets, respectively, in the end-to-end setting using no ground-truth segmentations. △ Less

Submitted 26 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: 5 pages, 1 figure, 3 tables, submitted to ICASSP 2022; updated reference information of [33]

arXiv:2110.03345 [pdf, other]

doi 10.1016/j.cmpb.2022.106855

Stride: a flexible platform for high-performance ultrasound computed tomography

Authors: Carlos Cueto, Oscar Bates, George Strong, Javier Cudeiro, Fabio Luporini, Oscar Calderon Agudo, Gerard Gorman, Lluis Guasch, Meng-Xing Tang

Abstract: Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the p… ▽ More Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the pace of research and hindering reproducibility. Consequently, we have developed Stride, an open-source Python library for the solution of large-scale ultrasound tomography problems. On one hand, Stride provides high-level interfaces and tools for expressing the types of optimisation problems encountered in medical ultrasound tomography. On the other, these high-level abstractions seamlessly integrate with high-performance wave-equation solvers and with scalable parallelisation routines. The wave-equation solvers are generated automatically using Devito, a domain specific language, and the parallelisation routines are provided through the custom actor-based library Mosaic. Through a series of examples, we show how Stride can handle realistic tomographic problems, in 2D and 3D, providing intuitive and flexible interfaces that scale from a local multi-processing environment to a multi-node high-performance cluster. △ Less

Submitted 18 May, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Journal ref: Computer Methods and Programs in Biomedicine, 221, 2022

arXiv:2109.10349 [pdf]

Enabling variable high spatial resolution retrieval from a long pulse BOTDA sensor

Authors: Zhao Ge, Li Shen, Can Zhao, Hao Wu, Zhiyong Zhao, Ming Tang

Abstract: In the field of Internet of Things, there is an urgent need for sensors with large-scale sensing capability for scenarios such as intelligent monitoring of production lines and urban infrastructure. Brillouin optical time domain analysis (BOTDA) sensors, which can monitor thousands of continuous points simultaneously, show great advantages in these applications. We propose a convolutional neural n… ▽ More In the field of Internet of Things, there is an urgent need for sensors with large-scale sensing capability for scenarios such as intelligent monitoring of production lines and urban infrastructure. Brillouin optical time domain analysis (BOTDA) sensors, which can monitor thousands of continuous points simultaneously, show great advantages in these applications. We propose a convolutional neural network (CNN) to process the data of conventional Brillouin optical time domain analysis (BOTDA) sensors, which achieves unprecedented performance improvement that allows to directly retrieve higher spatial resolution (SR) from the sensing system that use long pump pulses. By using the simulated Brillouin gain spectrums (BGSs) as the CNN input and the corresponding high SR BFS as the output target, the trained CNN is able to obtain a SR higher than the theoretical value determined by the pump pulse width. In the experiment, the CNN accurately retrieves 0.5-m hotspots from the measured BGS with pump pulses from 20 to 50 ns, and the acquired BFS is in great agreement with 45/40 ns differential pulse-width pair (DPP) measurement results. Compared with the DPP technique, the proposed CNN demonstrates a 2-fold improvement in BFS uncertainty with only half the measurement time. In addition, by changing the training datasets, the proposed CNN can obtain tunable high SR retrieval based on conventional BOTDA sensors that use long pulses without any requirement of hardware modifications. The proposed data post-processing approach paves the way to enable novel high spatial resolution BOTDA sensors, which brings substantial improvement over the state-of-the-art techniques in terms of system complexity, measurement time and reliability, etc. △ Less

Submitted 8 September, 2021; originally announced September 2021.

Comments: 7 pages, 6 figures

MSC Class: 78A15 ACM Class: I.2.1

arXiv:2108.05096 [pdf]

doi 10.1364/OL.440660

Omnidirectional ghost imaging system and unwrapping-free panoramic ghost imaging

Authors: Huan Cui, Jie Cao, Qun Hao, Dong Zhou, Mingyuan Tang, Kaiyu Zhang, Yingqiang Zhang

Abstract: Ghost imaging (GI) is a novel imaging method, which can reconstruct the object information by the light intensity correlation measurements. However, at present, the field of view (FOV) is limited to the illuminating range of the light patterns. To enlarge FOV of GI efficiently, here we proposed the omnidirectional ghost imaging system (OGIS), which can achieve a 360° omnidirectional FOV at one sho… ▽ More Ghost imaging (GI) is a novel imaging method, which can reconstruct the object information by the light intensity correlation measurements. However, at present, the field of view (FOV) is limited to the illuminating range of the light patterns. To enlarge FOV of GI efficiently, here we proposed the omnidirectional ghost imaging system (OGIS), which can achieve a 360° omnidirectional FOV at one shot only by adding a curved mirror. Moreover, by designing the retina-like annular patterns with log-polar patterns, OGIS can obtain unwrapping-free undistorted panoramic images with uniform resolution, which opens up a new way for the application of GI. △ Less

Submitted 11 August, 2021; originally announced August 2021.

arXiv:2106.02896 [pdf, other]

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

Authors: Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka

Abstract: With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework t… ▽ More With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework to make the SE models unharmful to ASR. Because most ASR training samples do not have corresponding clean signal references, we alternately perform two model update steps called SE-step and ASR-step. The SE-step uses clean and noisy signal pairs and a signal-based loss function. The ASR-step applies a pre-trained ASR model to training signals enhanced with the SE model. A cross-entropy loss between the ASR output and reference transcriptions is calculated to update the SE model parameters. Experimental results with realistic large-scale settings using ASR models trained on 75,000-hour data show that the proposed framework improves the word error rate for the SE output by 11.82% with little compromise in the SE quality. Performance analysis is also carried out by changing the ASR model, the data used for the ASR-step, and the schedule of the two update steps. △ Less

Submitted 5 June, 2021; originally announced June 2021.

Comments: Accepted to INTERSPEECH2021

arXiv:2103.10722 [pdf, other]

doi 10.1109/TUFFC.2021.3104342

Spatial response identification enables robust experimental ultrasound computed tomography

Authors: Carlos Cueto, Lluis Guasch, Javier Cudeiro, Oscar Calderon Agudo, Oscar Bates, George Strong, Meng-Xing Tang

Abstract: Ultrasound computed tomography techniques have the potential to provide clinicians with 3D, quantitative and high-resolution information of both soft and hard tissues such as the breast or the adult human brain. Their practical application requires accurate modelling of the acquisition setup: the spatial location, orientation, and impulse response of each ultrasound transducer. However, existing c… ▽ More Ultrasound computed tomography techniques have the potential to provide clinicians with 3D, quantitative and high-resolution information of both soft and hard tissues such as the breast or the adult human brain. Their practical application requires accurate modelling of the acquisition setup: the spatial location, orientation, and impulse response of each ultrasound transducer. However, existing calibration methods fail to accurately characterise these transducers unless their size can be considered negligible when compared to the dominant wavelength, which reduces signal-to-noise ratios below usable levels in the presence of high-contrast tissues such as the skull. In this paper, we introduce a methodology that can simultaneously estimate the location, orientation, and impulse response of the ultrasound transducers in a single calibration. We do this by extending spatial response identification, an algorithm that we have recently proposed to estimate transducer impulse responses. Our proposed methodology replaces the transducers in the acquisition device with a surrogate model whose effective response matches the experimental data by fitting a numerical model of wave propagation. This results in a flexible and robust calibration procedure that can accurately predict the behaviour of the ultrasound acquisition device without ever having to know where the real transducers are or their individual impulse response. Experimental results using a ring acquisition system show that spatial response identification produces calibrations of significantly higher quality than standard methodologies across all transducers, both in transmission and in reception. Experimental full-waveform inversion reconstructions of a tissue-mimicking phantom demonstrate that spatial response identification generates more accurate reconstructions than those produced with standard calibration techniques. △ Less

Submitted 5 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Journal ref: IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 69 (1) 27-37, 2022

arXiv:2102.07746 [pdf, other]

High contrast Ultrafast 3D Ultrasound Imaging using Row Column specific Frame Multiply and Sum

Authors: Joseph Hansen-Shearer, Marcelo Lerendegui, Matthieu Toulemonde, Meng-Xing Tang

Abstract: Row-column arrays have shown to be able to generate 3-D ultrafast ultrasound images with an order of magnitude less independent electronic channels than classic 2D matrix arrays. Unfortunately row-column array images suffer from major imaging artefacts due to the high side lobes. This paper proposes a row-column specific beamforming technique that exploits the incoherent nature of certain row colu… ▽ More Row-column arrays have shown to be able to generate 3-D ultrafast ultrasound images with an order of magnitude less independent electronic channels than classic 2D matrix arrays. Unfortunately row-column array images suffer from major imaging artefacts due to the high side lobes. This paper proposes a row-column specific beamforming technique that exploits the incoherent nature of certain row column array artefacts. The geometric mean of the data from each row and column pair is taken prior to summation in beamforming, thus drastically reducing incoherent imaging artefacts compared to traditional coherent compounding. The effectiveness of this technique was demonstrated in silico, and the results show an average fivefold reduction in side-lobe levels. Significantly improved contrast was demonstrated with Tissue-to-noise ratio increasing from $\sim$10dB to $\sim$30dB and Tissue Contrast Ratio increasing from $\sim$21dB to $\sim$42dB when using the proposed new method compared to Delay and Sum. These new techniques allowed for high quality 3D imaging whilst maintaining high frame rate potential. △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2102.04799 [pdf, other]

Multi-scale GCN-assisted two-stage network for joint segmentation of retinal layers and disc in peripapillary OCT images

Authors: Jiaxuan Li, Peiyao Jin, Jianfeng Zhu, Haidong Zou, Xun Xu, Min Tang, Minwen Zhou, Yu Gan, Jiangnan He, Yuye Ling, Yikai Su

Abstract: An accurate and automated tissue segmentation algorithm for retinal optical coherence tomography (OCT) images is crucial for the diagnosis of glaucoma. However, due to the presence of the optic disc, the anatomical structure of the peripapillary region of the retina is complicated and is challenging for segmentation. To address this issue, we developed a novel graph convolutional network (GCN)-ass… ▽ More An accurate and automated tissue segmentation algorithm for retinal optical coherence tomography (OCT) images is crucial for the diagnosis of glaucoma. However, due to the presence of the optic disc, the anatomical structure of the peripapillary region of the retina is complicated and is challenging for segmentation. To address this issue, we developed a novel graph convolutional network (GCN)-assisted two-stage framework to simultaneously label the nine retinal layers and the optic disc. Specifically, a multi-scale global reasoning module is inserted between the encoder and decoder of a U-shape neural network to exploit anatomical prior knowledge and perform spatial reasoning. We conducted experiments on human peripapillary retinal OCT images. The Dice score of the proposed segmentation network is 0.820$\pm$0.001 and the pixel accuracy is 0.830$\pm$0.002, both of which outperform those from other state-of-the-art techniques. △ Less

Submitted 9 February, 2021; originally announced February 2021.

arXiv:2011.05122 [pdf]

Scannerless non-line-of-sight three dimensional imaging with a 32x32 SPAD array

Authors: Chenfei Jin, Meng Tang, Legeng Jia, Xiaorui Tian, Jie Yang, Kai Qiao, Siqi Zhang

Abstract: We develop a scannerless non-line-of-sight three dimensional imaging system based on a commercial 32x32 SPAD camera combined with a 70 ps pulsed laser. In our experiment, 1024 time histograms can be achieved synchronously in 3s with an average time resolution of about 165 ps. The result with filtered back projection shows a discernable reconstruction while the result using virtual wave field demon… ▽ More We develop a scannerless non-line-of-sight three dimensional imaging system based on a commercial 32x32 SPAD camera combined with a 70 ps pulsed laser. In our experiment, 1024 time histograms can be achieved synchronously in 3s with an average time resolution of about 165 ps. The result with filtered back projection shows a discernable reconstruction while the result using virtual wave field demonstrates a better quality similar to the ones created by earlier scanning imaging systems with single pixel SPAD. Comparatively, our system has large potential advantages in frame frequency, power requirements, compactness and robustness. The research results will pave a path for scannerless non-line-of-sight three dimensional imaging application. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: 10 pages, 8 figures

arXiv:2009.08804 [pdf]

doi 10.1109/JLT.2020.3047504

Improving the spatial resolution of a BOTDA sensor using deconvolution algorithm

Authors: Li Shen, Zhiyong Zhao, Can Zhao, Hao Wu, Chao Lu, Ming Tang

Abstract: Spatial resolution improvement from an acquired measurement using long pulse is developed for Brillouin optical time domain analysis (BOTDA) systems based on the total variation deconvolution algorithm. The frequency dependency of Brillouin gain temporal envelope is investigated by simulation, and its impact on the recovered results of deconvolution algorithm is thoroughly analyzed. To implement a… ▽ More Spatial resolution improvement from an acquired measurement using long pulse is developed for Brillouin optical time domain analysis (BOTDA) systems based on the total variation deconvolution algorithm. The frequency dependency of Brillouin gain temporal envelope is investigated by simulation, and its impact on the recovered results of deconvolution algorithm is thoroughly analyzed. To implement a reliable deconvolution process, differential pulse-width pair (DPP) technique is utilized to effectively eliminate the systematic BFS distortion stemming from the frequency dependency of temporal envelope. The width of the pulse pairs should be larger than 40 ns as is analyzed theoretically and verified experimentally. It has been demonstrated that the proposed method can realize flexible adjustment of spatial resolution with enhanced signal-to-noise ratio (SNR) from an established measurement with long pump pulse. In the experiment, the spatial resolution is increased to 0.5 m and 1 m with high measurement accuracy by using the deconvolution algorithm from the measurement of 60/40 ns DPP signals. Compared with the raw DPP results with the same spatial resolution, 9.2 dB and 8.4 dB SNR improvements are obtained for 0.5 m and 1 m spatial resolution respectively, thanks to the denoising capability of the total variation deconvolution algorithm. The impact of sampling rate on the recovery results is also studied. The proposed sensing system allows for distortion-free Brillouin distributed sensing with higher spatial resolution and enhanced SNR from the conventional DPP setup with long pulse pairs. △ Less

Submitted 15 September, 2020; originally announced September 2020.

arXiv:2005.07796 [pdf, other]

FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network

Authors: Francesco Piccoli, Rajarathnam Balakrishnan, Maria Jesus Perez, Moraldeepsingh Sachdeo, Carlos Nunez, Matthew Tang, Kajsa Andreasson, Kalle Bjurek, Ria Dass Raj, Ebba Davidsson, Colin Eriksson, Victor Hagman, Jonas Sjoberg, Ying Li, L. Srikar Muppirisetty, Sohini Roychowdhury

Abstract: Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of… ▽ More Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of human pose. We study early, late, and combined (early and late) fusion mechanisms to exploit the skeletal features and reduce false positives as well to improve the intention prediction performance. The early fusion mechanism results in AP of 0.89 and precision/recall of 0.79/0.89 for pedestrian intention classification. Furthermore, we propose three new metrics to properly evaluate the pedestrian intention systems. Under these new evaluation metrics for the intention prediction, the proposed end-to-end network offers accurate pedestrian intention up to half a second ahead of the actual risky maneuver. △ Less

Submitted 15 May, 2020; originally announced May 2020.

Comments: 5 pages, 6 figures, 5 tables, IEEE Asilomar SSC

arXiv:2004.04362 [pdf, other]

doi 10.1109/TMI.2020.3030047

Detecting Dynamic Community Structure in Functional Brain Networks Across Individuals: A Multilayer Approach

Authors: Chee-Ming Ting, S. Balqis Samdin, Meini Tang, Hernando Ombao

Abstract: We present a unified statistical framework for characterizing community structure of brain functional networks that captures variation across individuals and evolution over time. Existing methods for community detection focus only on single-subject analysis of dynamic networks; while recent extensions to multiple-subjects analysis are limited to static networks. To overcome these limitations, we p… ▽ More We present a unified statistical framework for characterizing community structure of brain functional networks that captures variation across individuals and evolution over time. Existing methods for community detection focus only on single-subject analysis of dynamic networks; while recent extensions to multiple-subjects analysis are limited to static networks. To overcome these limitations, we propose a multi-subject, Markov-switching stochastic block model (MSS-SBM) to identify state-related changes in brain community organization over a group of individuals. We first formulate a multilayer extension of SBM to describe the time-dependent, multi-subject brain networks. We develop a novel procedure for fitting the multilayer SBM that builds on multislice modularity maximization which can uncover a common community partition of all layers (subjects) simultaneously. By augmenting with a dynamic Markov switching process, our proposed method is able to capture a set of distinct, recurring temporal states with respect to inter-community interactions over subjects and the change points between them. Simulation shows accurate community recovery and tracking of dynamic community regimes over multilayer networks by the MSS-SBM. Application to task fMRI reveals meaningful non-assortative brain community motifs, e.g., core-periphery structure at the group level, that are associated with language comprehension and motor functions suggesting their putative role in complex information integration. Our approach detected dynamic reconfiguration of modular connectivity elicited by varying task demands and identified unique profiles of intra and inter-community connectivity across different task conditions. The proposed multilayer network representation provides a principled way of detecting synchronous, dynamic modularity in brain networks across subjects. △ Less

Submitted 16 October, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: Main paper: 12 pages, 13 figures. Supplemental file: 16 pages. Accepted for IEEE Trans Medical Imaging

Journal ref: IEEE Trans Medical Imaging, vol. 40, no. 2 (2021) 468 - 480

arXiv:2001.03030 [pdf]

Distributed Brillouin frequency shift extraction via a convolutional neural network

Authors: Yiqing Chang, Hao Wu, Can Zhao, Li Shen, Songnian Fu, Ming Tang

Abstract: Distributed optical fiber Brillouin sensors detect the temperature and strain along a fiber according to the local Brillouin frequency shift, which is usually calculated by the measured Brillouin spectrum using Lorentzian curve fitting. In addition, cross-correlation, principal component analysis, and machine learning methods have been proposed for the more efficient extraction of Brillouin freque… ▽ More Distributed optical fiber Brillouin sensors detect the temperature and strain along a fiber according to the local Brillouin frequency shift, which is usually calculated by the measured Brillouin spectrum using Lorentzian curve fitting. In addition, cross-correlation, principal component analysis, and machine learning methods have been proposed for the more efficient extraction of Brillouin frequency shifts. However, existing methods only process the Brillouin spectrum individually, ignoring the correlation in the time domain, indicating that there is still room for improvement. Here, we propose and experimentally demonstrate a full convolution neural network to extract the distributed Brillouin frequency shift directly from the measured two-dimensional data. Simulated ideal Brillouin spectrum with various parameters are used to train the network. Both the simulation and experimental results show that the extraction accuracy of the network is better than that of the traditional curve fitting algorithm with a much shorter processing time. This network has good universality and robustness and can effectively improve the performances of existing Brillouin sensors. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Showing 1–50 of 58 results for author: Tang, M