Search | arXiv e-print repository

Dynamic Data Pruning for Automatic Speech Recognition

Authors: Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. Furthermore, we introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers several fine-grained pruning granularities specifically tailored for speech-related datasets, going beyond the conventional pruning of entire time sequences. Our intensive experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2406.10724 [pdf, other]

Beyond the Visible: Jointly Attending to Spectral and Spatial Dimensions with HSI-Diffusion for the FINCH Spacecraft

Authors: Ian Vyse, Rishit Dagli, Dav Vrat Chadha, John P. Ma, Hector Chen, Isha Ruparelia, Prithvi Seran, Matthew Xie, Eesa Aamer, Aidan Armstrong, Naveen Black, Ben Borstein, Kevin Caldwell, Orrin Dahanaggamaarachchi, Joe Dai, Abeer Fatima, Stephanie Lu, Maxime Michet, Anoushka Paul, Carrie Ann Po, Shivesh Prakash, Noa Prosser, Riddhiman Roy, Mirai Shinjo, Iliya Shofman , et al. (4 additional authors not shown)

Abstract: Satellite remote sensing missions have gained popularity over the past fifteen years due to their ability to cover large swaths of land at regular intervals, making them ideal for monitoring environmental trends. The FINCH mission, a 3U+ CubeSat equipped with a hyperspectral camera, aims to monitor crop residue cover in agricultural fields. Although hyperspectral imaging captures both spectral and… ▽ More Satellite remote sensing missions have gained popularity over the past fifteen years due to their ability to cover large swaths of land at regular intervals, making them ideal for monitoring environmental trends. The FINCH mission, a 3U+ CubeSat equipped with a hyperspectral camera, aims to monitor crop residue cover in agricultural fields. Although hyperspectral imaging captures both spectral and spatial information, it is prone to various types of noise, including random noise, stripe noise, and dead pixels. Effective denoising of these images is crucial for downstream scientific tasks. Traditional methods, including hand-crafted techniques encoding strong priors, learned 2D image denoising methods applied across different hyperspectral bands, or diffusion generative models applied independently on bands, often struggle with varying noise strengths across spectral bands, leading to significant spectral distortion. This paper presents a novel approach to hyperspectral image denoising using latent diffusion models that integrate spatial and spectral information. We particularly do so by building a 3D diffusion model and presenting a 3-stage training approach on real and synthetically crafted datasets. The proposed method preserves image structure while reducing noise. Evaluations on both popular hyperspectral denoising datasets and synthetically crafted datasets for the FINCH mission demonstrate the effectiveness of this approach. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: To appear in 38th Annual Small Satellite Conference

arXiv:2405.02191 [pdf]

Non-Destructive Peat Analysis using Hyperspectral Imaging and Machine Learning

Authors: Yijun Yan, Jinchang Ren, Barry Harrison, Oliver Lewis, Yinhe Li, Ping Ma

Abstract: Peat, a crucial component in whisky production, imparts distinctive and irreplaceable flavours to the final product. However, the extraction of peat disrupts ancient ecosystems and releases significant amounts of carbon, contributing to climate change. This paper aims to address this issue by conducting a feasibility study on enhancing peat use efficiency in whisky manufacturing through non-destru… ▽ More Peat, a crucial component in whisky production, imparts distinctive and irreplaceable flavours to the final product. However, the extraction of peat disrupts ancient ecosystems and releases significant amounts of carbon, contributing to climate change. This paper aims to address this issue by conducting a feasibility study on enhancing peat use efficiency in whisky manufacturing through non-destructive analysis using hyperspectral imaging. Results show that shot-wave infrared (SWIR) data is more effective for analyzing peat samples and predicting total phenol levels, with accuracies up to 99.81%. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: 4 pages,4 figures

arXiv:2405.01362 [pdf, other]

Wideband Penetration Loss through Building Materials and Partitions at 6.75 GHz in FR1(C) and 16.95 GHz in the FR3 Upper Mid-band spectrum

Authors: Dipankar Shakya, Mingjun Ying, Theodore S. Rappaport, Hitesh Poddar, Peijie Ma, Yanbo Wang, Idris Al-Wazani

Abstract: The 4--8 GHz FR1(C) and 7--24 GHz upper mid-band FR3 spectrum are promising new 6G spectrum allocations being considered by the International Telecommunications Union (ITU) and major governments around the world. There is an urgent need to understand the propagation behavior and radio coverage, outage, and material penetration for the global mobile wireless industry in both indoor and outdoor envi… ▽ More The 4--8 GHz FR1(C) and 7--24 GHz upper mid-band FR3 spectrum are promising new 6G spectrum allocations being considered by the International Telecommunications Union (ITU) and major governments around the world. There is an urgent need to understand the propagation behavior and radio coverage, outage, and material penetration for the global mobile wireless industry in both indoor and outdoor environments in these emerging frequency bands. This work presents measurements and models that describe the penetration loss in co-polarized and cross-polarized antenna configurations, exhibited by common materials found inside buildings and on building perimeters, including concrete, low-emissivity glass, wood, doors, drywall, and whiteboard at 6.75 GHz and 16.95 GHz. Measurement results show consistent lower penetration loss at 6.75 GHz compared to 16.95 GHz for all ten materials measured for co and cross-polarized antennas at incidence. For instance, the low-emissivity glass wall presents 33.7 dB loss at 6.75 GHz, while presenting 42.3 dB loss at 16.95 GHz. Penetration loss at these frequencies is contrasted with measurements at sub-6 GHz, mmWave and sub-THz frequencies along with 3GPP material penetration loss models. The results provide critical knowledge for future 5G and 6G cellular system deployments as well as refinements for the 3GPP material penetration models. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 6 pages, 4 figures, 2 tables, IEEE GLOBECOM 2024

arXiv:2405.01358 [pdf, other]

Propagation measurements and channel models in Indoor Environment at 6.75 GHz FR1(C) and 16.95 GHz FR3 Upper-mid band Spectrum for 5G and 6G

Authors: Dipankar Shakya, Mingjun Ying, Theodore S. Rappaport, Hitesh Poddar, Peijie Ma, Yanbo Wang, Idris Al-Wazani

Abstract: New spectrum allocations in the 4--8 GHz FR1(C) and 7--24 GHz FR3 mid-band frequency spectrum are being considered for 5G/6G cellular deployments. This paper presents results from the world's first comprehensive indoor hotspot (InH) propagation measurement campaign at 6.75 GHz and 16.95 GHz in the NYU WIRELESS Research Center using a 1 GHz wideband channel sounder system over distances from 11 to… ▽ More New spectrum allocations in the 4--8 GHz FR1(C) and 7--24 GHz FR3 mid-band frequency spectrum are being considered for 5G/6G cellular deployments. This paper presents results from the world's first comprehensive indoor hotspot (InH) propagation measurement campaign at 6.75 GHz and 16.95 GHz in the NYU WIRELESS Research Center using a 1 GHz wideband channel sounder system over distances from 11 to 97 m in line-of-sight (LOS) and non-LOS (NLOS). Analysis of directional and omnidirectional path loss (PL) using the close-in free space 1 m reference distance model shows a familiar waveguiding effect in LOS with an omnidirectional path loss exponent (PLE) of 1.40 at 6.75 GHz and 1.32 at 16.95 GHz. Compared to mmWave frequencies, the directional NLOS PLEs are lower at FR3 and FR1(C), while omnidirectional NLOS PLEs are similar, suggesting better propagation distances at lower frequencies for links with omnidirectional antennas at both ends of the links, but also, importantly, showing that higher gain antennas will offer better coverage at higher frequencies when antenna apertures are kept same over all frequencies. Comparison of the omnidirectional and directional RMS delay spread (DS) at FR1(C) and FR3 with mmWave frequencies indicates a clear decrease with increasing frequency. The mean spatial lobe and omnidirectional RMS angular spread (AS) is found to be wider at 6.75 GHz compared to 16.95 GHz indicating more multipath components are found in the azimuthal spatial domain at lower frequencies. △ Less

Submitted 6 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: 6 pages, 7 figures, 4 tables, IEEE GLOBECOM 2024

arXiv:2404.09200 [pdf, other]

Tube-RRT*: Efficient Homotopic Path Planning for Swarm Robotics Passing-Through Large-Scale Obstacle Environments

Authors: Pengda Mao, Quan Quan

Abstract: Recently, the concept of optimal virtual tube has emerged as a novel solution to the challenging task of navigating obstacle-dense environments for swarm robotics, offering a wide ranging of applications. However, it lacks an efficient homotopic path planning method in obstacle-dense environments. This paper introduces Tube-RRT*, an innovative homotopic path planning method that builds upon and im… ▽ More Recently, the concept of optimal virtual tube has emerged as a novel solution to the challenging task of navigating obstacle-dense environments for swarm robotics, offering a wide ranging of applications. However, it lacks an efficient homotopic path planning method in obstacle-dense environments. This paper introduces Tube-RRT*, an innovative homotopic path planning method that builds upon and improves the Rapidly-exploring Random Tree (RRT) algorithm. Tube-RRT* is specifically designed to generate homotopic paths for the trajectories in the virtual tube, strategically considering opening volume and tube length to mitigate swarm congestion and ensure agile navigation. Through comprehensive comparative simulations conducted within complex, large-scale obstacle environments, we demonstrate the effectiveness of Tube-RRT*. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: 8 pages, 8 figures, submitted to RA-L

arXiv:2404.06784 [pdf]

Statistical evaluation of 571 GaAs quantum point contact transistors showing the 0.7 anomaly in quantized conductance using millikelvin cryogenic on-chip multiplexing

Authors: Pengcheng Ma, Kaveh Delfanazari, Reuben K. Puddy, Jiahui Li, Moda Cao, Teng Yi, Jonathan P. Griffiths, Harvey E. Beere, David A. Ritchie, Michael J. Kelly, Charles G. Smith

Abstract: The mass production and the practical number of cryogenic quantum devices producible in a single chip are limited to the number of electrical contact pads and wiring of the cryostat or dilution refrigerator. It is, therefore, beneficial to contrast the measurements of hundreds of devices fabricated in a single chip in one cooldown process to promote the scalability, integrability, reliability, and… ▽ More The mass production and the practical number of cryogenic quantum devices producible in a single chip are limited to the number of electrical contact pads and wiring of the cryostat or dilution refrigerator. It is, therefore, beneficial to contrast the measurements of hundreds of devices fabricated in a single chip in one cooldown process to promote the scalability, integrability, reliability, and reproducibility of quantum devices and to save evaluation time, cost and energy. Here, we use a cryogenic on-chip multiplexer architecture and investigate the statistics of the 0.7 anomaly observed on the first three plateaus of the quantized conductance of semiconductor quantum point contact (QPC) transistors. Our single chips contain 256 split gate field effect QPC transistors (QFET) each, with two 16-branch multiplexed source-drain and gate pads, allowing individual transistors to be selected, addressed and controlled through an electrostatic gate voltage process. A total of 1280 quantum transistors with nano-scale dimensions are patterned in 5 different chips of GaAs heterostructures. From the measurements of 571 functioning QPCs taken at temperatures T= 1.4 K and T= 40 mK, it is found that the spontaneous polarisation model and Kondo effect do not fit our results. Furthermore, some of the features in our data largely agreed with van Hove model with short-range interactions. Our approach provides further insight into the quantum mechanical properties and microscopic origin of the 0.7 anomaly in QPCs, paving the way for the development of semiconducting quantum circuits and integrated cryogenic electronics, for scalable quantum logic control, readout, synthesis, and processing applications. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2401.17575 [pdf, other]

Can We Improve Channel Reciprocity via Loop-back Compensation for RIS-assisted Physical Layer Key Generation

Authors: Ningya Xu, Guoshun Nan, Xiaofeng Tao, Na Li, Pengxuan Mao, Tianyuan Yang

Abstract: Reconfigurable intelligent surface (RIS) facilitates the extraction of unpredictable channel features for physical layer key generation (PKG), securing communications among legitimate users with symmetric keys. Previous works have demonstrated that channel reciprocity plays a crucial role in generating symmetric keys in PKG systems, whereas, in reality, reciprocity is greatly affected by hardware… ▽ More Reconfigurable intelligent surface (RIS) facilitates the extraction of unpredictable channel features for physical layer key generation (PKG), securing communications among legitimate users with symmetric keys. Previous works have demonstrated that channel reciprocity plays a crucial role in generating symmetric keys in PKG systems, whereas, in reality, reciprocity is greatly affected by hardware interference and RIS-based jamming attacks. This motivates us to propose LoCKey, a novel approach that aims to improve channel reciprocity by mitigating interferences and attacks with a loop-back compensation scheme, thus maximizing the secrecy performance of the PKG system. Specifically, our proposed LoCKey is capable of effectively compensating for the CSI non-reciprocity by the combination of transmit-back signal value and error minimization module. Firstly, we introduce the entire flowchart of our method and provide an in-depth discussion of each step. Following that, we delve into a theoretical analysis of the performance optimizations when our LoCKey is applied for CSI reciprocity enhancement. Finally, we conduct experiments to verify the effectiveness of the proposed LoCKey in improving channel reciprocity under various interferences for RIS-assisted wireless communications. The results demonstrate a significant improvement in both the rate of key generation assisted by the RIS and the consistency of the generated keys, showing great potential for the practical deployment of our LoCKey in future wireless systems. △ Less

Submitted 13 August, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted by ICC 2024

arXiv:2310.17864 [pdf, other]

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's devel… ▽ More TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2303.17200 [pdf, other]

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen

Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems wit… ▽ More Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method. △ Less

Submitted 3 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: IEEE/CVF CVPR 2023

arXiv:2303.14307 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096889

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic

Abstract: Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence… ▽ More Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data. △ Less

Submitted 28 June, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: Accepted to ICASSP 2023

arXiv:2303.09455 [pdf, other]

Learning Cross-lingual Visual Speech Representations

Authors: Andreas Zinonos, Alexandros Haliassos, Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled… ▽ More Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled multilingual data, and then fine-tune the visual model on labelled transcriptions. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance; (2) multi-lingual outperforms English-only pre-training; (3) using languages which are more similar yields better results; and (4) fine-tuning on unseen languages is competitive to using the target language in the pre-training set. We hope our study inspires future research on non-English-only speech representation learning. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2302.13854 [pdf, other]

doi 10.1093/rasti/rzad056

A Deep Neural Network Based Reverse Radio Spectrogram Search Algorithm

Authors: Peter Xiangyuan Ma, Steve Croft, Chris Lintott, Andrew P. V. Siemion

Abstract: Modern radio astronomy instruments generate vast amounts of data, and the increasingly challenging radio frequency interference (RFI) environment necessitates ever-more sophisticated RFI rejection algorithms. The "needle in a haystack" nature of searches for transients and technosignatures requires us to develop methods that can determine whether a signal of interest has unique properties, or is a… ▽ More Modern radio astronomy instruments generate vast amounts of data, and the increasingly challenging radio frequency interference (RFI) environment necessitates ever-more sophisticated RFI rejection algorithms. The "needle in a haystack" nature of searches for transients and technosignatures requires us to develop methods that can determine whether a signal of interest has unique properties, or is a part of some larger set of pernicious RFI. In the past, this vetting has required onerous manual inspection of very large numbers of signals. In this paper we present a fast and modular deep learning algorithm to search for lookalike signals of interest in radio spectrogram data. First, we trained a B-Variational Autoencoder on signals returned by an energy detection algorithm. We then adapted a positional embedding layer from classical Transformer architecture to a embed additional metadata, which we demonstrate using a frequency-based embedding. Next we used the encoder component of the B-Variational Autoencoder to extract features from small (~ 715,Hz, with a resolution of 2.79Hz per frequency bin) windows in the radio spectrogram. We used our algorithm to conduct a search for a given query (encoded signal of interest) on a set of signals (encoded features of searched items) to produce the top candidates with similar features. We successfully demonstrate that the algorithm retrieves signals with similar appearance, given only the original radio spectrogram data. This algorithm can be used to improve the efficiency of vetting signals of interest in technosignature searches, but could also be applied to a wider variety of searches for "lookalike" signals in large astronomical datasets. △ Less

Submitted 18 January, 2024; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: 8 pages, 8 figures

Journal ref: RAS Techniques and Instruments 2023

arXiv:2211.02133 [pdf, other]

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Authors: Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic

Abstract: In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by… ▽ More In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by using the triggered attention technique, which performs time-synchronous decoding with joint CTC/attention scoring. Additionally, we propose a novel alignment regularization technique that promotes synchronization of the audio and visual encoder, which in turn results in better word error rates (WERs) at all SNR levels for streaming and offline AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup, respectively, which both present state-of-the-art results when no external training data are used. △ Less

Submitted 1 July, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: Accepted to Interspeech 2023

arXiv:2207.14166 [pdf, ps, other]

RHA-Net: An Encoder-Decoder Network with Residual Blocks and Hybrid Attention Mechanisms for Pavement Crack Segmentation

Authors: Guijie Zhu, Zhun Fan, Jiacheng Liu, Duan Yuan, Peili Ma, Meihua Wang, Weihua Sheng, Kelvin C. P. Wang

Abstract: The acquisition and evaluation of pavement surface data play an essential role in pavement condition evaluation. In this paper, an efficient and effective end-to-end network for automatic pavement crack segmentation, called RHA-Net, is proposed to improve the pavement crack segmentation accuracy. The RHA-Net is built by integrating residual blocks (ResBlocks) and hybrid attention blocks into the e… ▽ More The acquisition and evaluation of pavement surface data play an essential role in pavement condition evaluation. In this paper, an efficient and effective end-to-end network for automatic pavement crack segmentation, called RHA-Net, is proposed to improve the pavement crack segmentation accuracy. The RHA-Net is built by integrating residual blocks (ResBlocks) and hybrid attention blocks into the encoder-decoder architecture. The ResBlocks are used to improve the ability of RHA-Net to extract high-level abstract features. The hybrid attention blocks are designed to fuse both low-level features and high-level features to help the model focus on correct channels and areas of cracks, thereby improving the feature presentation ability of RHA-Net. An image data set containing 789 pavement crack images collected by a self-designed mobile robot is constructed and used for training and evaluating the proposed model. Compared with other state-of-the-art networks, the proposed model achieves better performance and the functionalities of adding residual blocks and hybrid attention mechanisms are validated in a comprehensive ablation study. Additionally, a light-weighted version of the model generated by introducing depthwise separable convolution achieves better a performance and a much faster processing speed with 1/30 of the number of U-Net parameters. The developed system can segment pavement crack in real-time on an embedded device Jetson TX2 (25 FPS). The video taken in real-time experiments is released at https://youtu.be/3XIogk0fiG4. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2202.13084 [pdf, other]

doi 10.1038/s42256-022-00550-z

Visual Speech Recognition for Multiple Languages in the Wild

Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. H… ▽ More Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement. △ Less

Submitted 30 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

Comments: Published in Nature Machine Intelligence

arXiv:2202.09020 [pdf, other]

A Comprehensive Survey with Quantitative Comparison of Image Analysis Methods for Microorganism Biovolume Measurements

Authors: Jiawei Zhang, Chen Li, Md Mamunur Rahaman, Yudong Yao, Pingli Ma, Jinghua Zhang, Xin Zhao, Tao Jiang, Marcin Grzegorzek

Abstract: With the acceleration of urbanization and living standards, microorganisms play increasingly important roles in industrial production, bio-technique, and food safety testing. Microorganism biovolume measurements are one of the essential parts of microbial analysis. However, traditional manual measurement methods are time-consuming and challenging to measure the characteristics precisely. With the… ▽ More With the acceleration of urbanization and living standards, microorganisms play increasingly important roles in industrial production, bio-technique, and food safety testing. Microorganism biovolume measurements are one of the essential parts of microbial analysis. However, traditional manual measurement methods are time-consuming and challenging to measure the characteristics precisely. With the development of digital image processing techniques, the characteristics of the microbial population can be detected and quantified. The changing trend can be adjusted in time and provided a basis for the improvement. The applications of the microorganism biovolume measurement method have developed since the 1980s. More than 62 articles are reviewed in this study, and the articles are grouped by digital image segmentation methods with periods. This study has high research significance and application value, which can be referred to microbial researchers to have a comprehensive understanding of microorganism biovolume measurements using digital image analysis methods and potential applications. △ Less

Submitted 2 May, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.07820 [pdf, other]

A Survey of Semen Quality Evaluation in Microscopic Videos Using Computer Assisted Sperm Analysis

Authors: Wenwei Zhao, Pingli Ma, Chen Li, Xiaoning Bu, Shuojia Zou, Tao Jiang, Marcin Grzegorzek

Abstract: The Computer Assisted Sperm Analysis (CASA) plays a crucial role in male reproductive health diagnosis and Infertility treatment. With the development of the computer industry in recent years, a great of accurate algorithms are proposed. With the assistance of those novel algorithms, it is possible for CASA to achieve a faster and higher quality result. Since image processing is the technical basi… ▽ More The Computer Assisted Sperm Analysis (CASA) plays a crucial role in male reproductive health diagnosis and Infertility treatment. With the development of the computer industry in recent years, a great of accurate algorithms are proposed. With the assistance of those novel algorithms, it is possible for CASA to achieve a faster and higher quality result. Since image processing is the technical basis of CASA, including pre-processing,feature extraction, target detection and tracking, these methods are important technical steps in dealing with CASA. The various works related to Computer Assisted Sperm Analysis methods in the last 30 years (since 1988) are comprehensively introduced and analysed in this survey. To facilitate understanding, the methods involved are analysed in the sequence of general steps in sperm analysis. In other words, the methods related to sperm detection (localization) are first analysed, and then the methods of sperm tracking are analysed. Beside this, we analyse and prospect the present situation and future of CASA. According to our work, the feasible for applying in sperm microscopic video of methods mentioned in this review is explained. Moreover, existing challenges of object detection and tracking in microscope video are potential to be solved inspired by this survey. △ Less

Submitted 17 February, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

arXiv:2106.09171 [pdf, other]

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

Authors: Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic

Abstract: The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training… ▽ More The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted for publication at Interspeech 2021

arXiv:2104.13332 [pdf, other]

doi 10.1109/TCYB.2022.3162495

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Björn W. Schuller, Maja Pantic

Abstract: Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based… ▽ More Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW (Lip Reading in the Wild), featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios -- seen and unseen speakers -- using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW. △ Less

Submitted 15 August, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: Published in IEEE Transactions on Cybernetics (April 2022)

arXiv:2103.13625 [pdf, other]

A Comprehensive Review of Image Analysis Methods for Microorganism Counting: From Classical Image Processing to Deep Learning Approaches

Authors: Jiawei Zhang, Chen Li, Md Mamunur Rahaman, Yudong Yao, Pingli Ma, Jinghua Zhang, Xin Zhao, Tao Jiang, Marcin Grzegorzek

Abstract: Microorganisms such as bacteria and fungi play essential roles in many application fields, like biotechnique, medical technique and industrial domain. Microorganism counting techniques are crucial in microorganism analysis, helping biologists and related researchers quantitatively analyze the microorganisms and calculate their characteristics, such as biomass concentration and biological activity.… ▽ More Microorganisms such as bacteria and fungi play essential roles in many application fields, like biotechnique, medical technique and industrial domain. Microorganism counting techniques are crucial in microorganism analysis, helping biologists and related researchers quantitatively analyze the microorganisms and calculate their characteristics, such as biomass concentration and biological activity. However, traditional microorganism manual counting methods, such as plate counting method, hemocytometry and turbidimetry, are time-consuming, subjective and need complex operations, which are difficult to be applied in large-scale applications. In order to improve this situation, image analysis is applied for microorganism counting since the 1980s, which consists of digital image processing, image segmentation, image classification and suchlike. Image analysis-based microorganism counting methods are efficient comparing with traditional plate counting methods. In this article, we have studied the development of microorganism counting methods using digital image analysis. Firstly, the microorganisms are grouped as bacteria and other microorganisms. Then, the related articles are summarized based on image segmentation methods. Each part of the article is reviewed by methodologies. Moreover, commonly used image processing methods for microorganism counting are summarized and analyzed to find common technological points. More than 144 papers are outlined in this article. In conclusion, this paper provides new ideas for the future development trend of microorganism counting, and provides systematic suggestions for implementing integrated microorganism counting systems in the future. Researchers in other fields can refer to the techniques analyzed in this paper. △ Less

Submitted 29 September, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

arXiv:2103.03447 [pdf, other]

User-Centric Cooperative MEC Service Offloading

Authors: Ruoyun Chen, Hancheng Lu, Pengfei Ma

Abstract: Mobile edge computing provides users with a cloud environment close to the edge of the wireless network, supporting the computing intensive applications that have low latency requirements. The combination of offloading with the wireless communication brings new challenges. This paper investigates the service caching problem during the long-term service offloading in the user-centric wireless netwo… ▽ More Mobile edge computing provides users with a cloud environment close to the edge of the wireless network, supporting the computing intensive applications that have low latency requirements. The combination of offloading with the wireless communication brings new challenges. This paper investigates the service caching problem during the long-term service offloading in the user-centric wireless network. To meet the time-varying service demands of a typical user, a cooperative service caching strategy in the unit of the base station (BS) cluster is proposed. We formulate the caching problem as a time-averaged completion delay minimization problem and transform it into time-decoupled instantaneous problems with a virtual caching cost queue at first. Then we propose a distributed algorithm which is based on the consensus-sharing alternating direction method of multipliers to solve each instantaneous problem. The simulations validate that the proposed online distributed service caching algorithm can achieve the optimal time-averaged completion delay of offloading tasks with the smallest caching cost in the unit of a BS cluster. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: 6 pages

arXiv:2102.06657 [pdf, other]

End-to-end Audio-visual Speech Recognition with Conformers

Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). T… ▽ More In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments. △ Less

Submitted 12 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2001.08702 [pdf, other]

Lipreading using Temporal Convolutional Networks

Authors: Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU l… ▽ More Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU layers are replaced with Temporal Convolutional Networks (TCN). Secondly, we greatly simplify the training procedure, which allows us to train the model in one single stage. Thirdly, we show that the current state-of-the-art methodology produces models that do not generalize well to variations on the sequence length, and we addresses this issue by proposing a variable-length augmentation. We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively. Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets which is the new state-of-the-art performance. △ Less

Submitted 23 January, 2020; originally announced January 2020.

arXiv:2001.04316 [pdf, other]

Visually Guided Self Supervised Learning of Speech Representations

Authors: Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio represent… ▽ More Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications. △ Less

Submitted 20 February, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

Comments: Accepted at ICASSP 2020 v2: Updated to the ICASSP 2020 camera ready version

arXiv:1912.08639 [pdf, other]

Detecting Adversarial Attacks On Audiovisual Speech Recognition

Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Adversarial attacks pose a threat to deep learning models. However, research on adversarial detection methods, especially in the multi-modal domain, is very limited. In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will b… ▽ More Adversarial attacks pose a threat to deep learning models. However, research on adversarial detection methods, especially in the multi-modal domain, is very limited. In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise. We use the synchronisation confidence score as a proxy for audiovisual correlation and based on it we can detect adversarial attacks. To the best of our knowledge, this is the first work on detection of adversarial attacks on audiovisual speech recognition models. We apply recent adversarial attacks on two audiovisual speech recognition models trained on the GRID and LRW datasets. The experimental results demonstrate that the proposed approach is an effective way for detecting such attacks. △ Less

Submitted 12 February, 2021; v1 submitted 18 December, 2019; originally announced December 2019.

Comments: Accepted to ICASSP 2021

arXiv:1906.06301 [pdf, other]

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Authors: Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capab… ▽ More Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capable of producing natural sounding, intelligible speech which is synchronised with the video. The performance of our model is evaluated on the GRID dataset for both speaker dependent and speaker independent scenarios. To the best of our knowledge this is the first method that maps video directly to raw audio and the first to produce intelligible speech when tested on previously unseen speakers. We evaluate the synthesised audio not only based on the sound quality but also on the accuracy of the spoken words. △ Less

Submitted 14 June, 2019; originally announced June 2019.

arXiv:1906.02112 [pdf, other]

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

Abstract: Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip mo… ▽ More Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip movements. In this paper, we investigate the impact of the Lombard effect in audio-visual speech recognition. To the best of our knowledge, this is the first work which does so using end-to-end deep architectures and presents results on unseen speakers. Our results show that properly modelling Lombard speech is always beneficial. Even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. We also show that the standard approach followed in the literature, where a model is trained and tested on noisy plain speech, provides a correct estimate of the video-only performance and slightly underestimates the audio-visual performance. In case of audio-only approaches, performance is overestimated for SNRs higher than -3dB and underestimated for lower SNRs. △ Less

Submitted 9 July, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted for publication at Interspeech 2019

arXiv:1903.03474 [pdf, other]

doi 10.1109/JLT.2019.2945017

Demonstration of multivariate photonics: blind dimensionality reduction with analog integrated photonics

Authors: Alexander N. Tait, Philip Y. Ma, Thomas Ferreira de Lima, Eric C. Blow, Matthew P. Chang, Mitchell A. Nahmias, Bhavin J. Shastri, Paul R. Prucnal

Abstract: Multi-antenna radio front-ends generate a multi-dimensional flood of information, most of which is partially redundant. Redundancy is eliminated by dimensionality reduction, but contemporary digital processing techniques face harsh fundamental tradeoffs when implementing this class of functions. These tradeoffs can be broken in the analog domain, in which the performance of optical technologies gr… ▽ More Multi-antenna radio front-ends generate a multi-dimensional flood of information, most of which is partially redundant. Redundancy is eliminated by dimensionality reduction, but contemporary digital processing techniques face harsh fundamental tradeoffs when implementing this class of functions. These tradeoffs can be broken in the analog domain, in which the performance of optical technologies greatly exceeds that of electronic counterparts. Here, we present concepts, methods, and a first demonstration of multivariate photonics: a combination of integrated photonic hardware, analog dimensionality reduction, and blind algorithmic techniques. We experimentally demonstrate 2-channel, 1.0 GHz principal component analysis in a photonic weight bank using recently proposed algorithms for synthesizing the multivariate properties of signals to which the receiver is blind. Novel methods are introduced for controlling blindness conditions in a laboratory context. This work provides a foundation for further research in multivariate photonic information processing, which is poised to play a role in future generations of wireless technology. △ Less

Submitted 10 February, 2019; originally announced March 2019.

Comments: 24 pages, 7 figures

Showing 1–29 of 29 results for author: Mao, P