Search | arXiv e-print repository

Exploiting Consistency-Preserving Loss and Perceptual Contrast Stretching to Boost SSL-based Speech Enhancement

Authors: Muhammad Salman Khan, Moreno La Quatra, Kuo-Hsuan Hung, Szu-Wei Fu, Sabato Marco Siniscalchi, Yu Tsao

Abstract: Self-supervised representation learning (SSL) has attained SOTA results on several downstream speech tasks, but SSL-based speech enhancement (SE) solutions still lag behind. To address this issue, we exploit three main ideas: (i) Transformer-based masking generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). In detail, conformer layers, leveraging an attenti… ▽ More Self-supervised representation learning (SSL) has attained SOTA results on several downstream speech tasks, but SSL-based speech enhancement (SE) solutions still lag behind. To address this issue, we exploit three main ideas: (i) Transformer-based masking generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). In detail, conformer layers, leveraging an attention mechanism, are introduced to effectively model frame-level representations and obtain the Ideal Ratio Mask (IRM) for SE. Moreover, we incorporate consistency in the loss function, which processes the input to account for the inconsistency effects of signal reconstruction from the spectrogram. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. Evaluated on the VoiceBank-DEMAND task, the proposed solution outperforms previously SSL-based SE solutions when tested on several objective metrics, attaining a SOTA PESQ score of 3.54. △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2407.07347 [pdf, other]

MNeRV: A Multilayer Neural Representation for Videos

Authors: Qingling Chang, Haohui Yu, Shuxuan Fu, Zhiqiang Zeng, Chuangquan Chen

Abstract: As a novel video representation method, Neural Representations for Videos (NeRV) has shown great potential in the fields of video compression, video restoration, and video interpolation. In the process of representing videos using NeRV, each frame corresponds to an embedding, which is then reconstructed into a video frame sequence after passing through a small number of decoding layers (E-NeRV, HN… ▽ More As a novel video representation method, Neural Representations for Videos (NeRV) has shown great potential in the fields of video compression, video restoration, and video interpolation. In the process of representing videos using NeRV, each frame corresponds to an embedding, which is then reconstructed into a video frame sequence after passing through a small number of decoding layers (E-NeRV, HNeRV, etc.). However, this small number of decoding layers can easily lead to the problem of redundant model parameters due to the large proportion of parameters in a single decoding layer, which greatly restricts the video regression ability of neural network models. In this paper, we propose a multilayer neural representation for videos (MNeRV) and design a new decoder M-Decoder and its matching encoder M-Encoder. MNeRV has more encoding and decoding layers, which effectively alleviates the problem of redundant model parameters caused by too few layers. In addition, we design MNeRV blocks to perform more uniform and effective parameter allocation between decoding layers. In the field of video regression reconstruction, we achieve better reconstruction quality (+4.06 PSNR) with fewer parameters. Finally, we showcase MNeRV performance in downstream tasks such as video restoration and video interpolation. The source code of MNeRV is available at https://github.com/Aaronbtb/MNeRV. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 14 pages, 12 figures, 8 table

arXiv:2406.18871 [pdf, other]

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2405.06573 [pdf, other]

An Investigation of Incorporating Mamba for Speech Enhancement

Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric-oriented loss functions. SEMamba demonstrates promising results and attains a PESQ score of 3.55 on the VoiceBank-DEMAND dataset. When combined with the perceptual contrast stretching technique, the proposed SEMamba yields a new state-of-the-art PESQ score of 3.69. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2402.16321 [pdf, other]

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Authors: Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variatio… ▽ More Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: Published as a conference paper at ICLR 2024

arXiv:2401.12468 [pdf, ps, other]

Minimum observability of probabilistic Boolean networks

Authors: Jiayi Xu, Shihua Fu, Liyuan Xia, Jianjun Wang

Abstract: This paper studies the minimum observability of probabilistic Boolean networks (PBNs), the main objective of which is to add the fewest measurements to make an unobservable PBN become observable. First of all, the algebraic form of a PBN is established with the help of semi-tensor product (STP) of matrices. By combining the algebraic forms of two identical PBNs into a parallel system, a method to… ▽ More This paper studies the minimum observability of probabilistic Boolean networks (PBNs), the main objective of which is to add the fewest measurements to make an unobservable PBN become observable. First of all, the algebraic form of a PBN is established with the help of semi-tensor product (STP) of matrices. By combining the algebraic forms of two identical PBNs into a parallel system, a method to search the states that need to be H-distinguishable is proposed based on the robust set reachability technique. Secondly, a necessary and sufficient condition is given to find the minimum measurements such that a given set can be H-distinguishable. Moreover, by comparing the numbers of measurements for all the feasible H-distinguishable state sets, the least measurements that make the system observable are gained. Finally, an example is given to verify the validity of the obtained results. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.01165 [pdf, other]

Reinforcement Learning for SAR View Angle Inversion with Differentiable SAR Renderer

Authors: Yanni Wang, Hecheng Jia, Shilei Fu, Huiping Lin, Feng Xu

Abstract: The electromagnetic inverse problem has long been a research hotspot. This study aims to reverse radar view angles in synthetic aperture radar (SAR) images given a target model. Nonetheless, the scarcity of SAR data, combined with the intricate background interference and imaging mechanisms, limit the applications of existing learning-based approaches. To address these challenges, we propose an in… ▽ More The electromagnetic inverse problem has long been a research hotspot. This study aims to reverse radar view angles in synthetic aperture radar (SAR) images given a target model. Nonetheless, the scarcity of SAR data, combined with the intricate background interference and imaging mechanisms, limit the applications of existing learning-based approaches. To address these challenges, we propose an interactive deep reinforcement learning (DRL) framework, where an electromagnetic simulator named differentiable SAR render (DSR) is embedded to facilitate the interaction between the agent and the environment, simulating a human-like process of angle prediction. Specifically, DSR generates SAR images at arbitrary view angles in real-time. And the differences in sequential and semantic aspects between the view angle-corresponding images are leveraged to construct the state space in DRL, which effectively suppress the complex background interference, enhance the sensitivity to temporal variations, and improve the capability to capture fine-grained information. Additionally, in order to maintain the stability and convergence of our method, a series of reward mechanisms, such as memory difference, smoothing and boundary penalty, are utilized to form the final reward function. Extensive experiments performed on both simulated and real datasets demonstrate the effectiveness and robustness of our proposed method. When utilized in the cross-domain area, the proposed method greatly mitigates inconsistency between simulated and real domains, outperforming reference methods significantly. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:2311.08878 [pdf, other]

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

Authors: Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen

Abstract: Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, cal… ▽ More Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2309.12766 [pdf, other]

A Study on Incorporating Whisper for Robust Speech Assessment

Authors: Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh

Abstract: This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results r… ▽ More This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems. △ Less

Submitted 29 April, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

Comments: Accepted to IEEE ICME 2024

arXiv:2307.04517 [pdf, other]

Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility

Authors: Hsin-Tien Chiang, Kuo-Hsuan Hung, Szu-Wei Fu, Heng-Cheng Kuo, Ming-Hsueh Tsai, Yu Tsao

Abstract: Subjective tests are the gold standard for evaluating speech quality and intelligibility; however, they are time-consuming and expensive. Thus, objective measures that align with human perceptions are crucial. This study evaluates the correlation between commonly used objective measures and subjective speech quality and intelligibility using a Chinese speech dataset. Moreover, new objective measur… ▽ More Subjective tests are the gold standard for evaluating speech quality and intelligibility; however, they are time-consuming and expensive. Thus, objective measures that align with human perceptions are crucial. This study evaluates the correlation between commonly used objective measures and subjective speech quality and intelligibility using a Chinese speech dataset. Moreover, new objective measures are proposed that combine current objective measures using deep learning techniques to predict subjective quality and intelligibility. The proposed deep learning model reduces the amount of training data without significantly affecting prediction performance. We analyzed the deep learning model to understand how objective measures reflect subjective quality and intelligibility. We also explored the impact of including subjective speech quality ratings on speech intelligibility prediction. Our findings offer valuable insights into the relationship between objective measures and human perceptions. △ Less

Submitted 10 October, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

arXiv:2304.00658 [pdf, other]

Improving Meeting Inclusiveness using Speech Interruption Analysis

Authors: Szu-Wei Fu, Yaran Fan, Yasaman Hosseinkashi, Jayant Gupchup, Ross Cutler

Abstract: Meetings are a pervasive method of communication within all types of companies and organizations, and using remote collaboration systems to conduct meetings has increased dramatically since the COVID-19 pandemic. However, not all meetings are inclusive, especially in terms of the participation rates among attendees. In a recent large-scale survey conducted at Microsoft, the top suggestion given by… ▽ More Meetings are a pervasive method of communication within all types of companies and organizations, and using remote collaboration systems to conduct meetings has increased dramatically since the COVID-19 pandemic. However, not all meetings are inclusive, especially in terms of the participation rates among attendees. In a recent large-scale survey conducted at Microsoft, the top suggestion given by meeting participants for improving inclusiveness is to improve the ability of remote participants to interrupt and acquire the floor during meetings. We show that the use of the virtual raise hand (VRH) feature can lead to an increase in predicted meeting inclusiveness at Microsoft. One challenge is that VRH is used in less than 1% of all meetings. In order to drive adoption of its usage to improve inclusiveness (and participation), we present a machine learning-based system that predicts when a meeting participant attempts to obtain the floor, but fails to interrupt (termed a `failed interruption'). This prediction can be used to nudge the user to raise their virtual hand within the meeting. We believe this is the first failed speech interruption detector, and the performance on a realistic test set has an area under curve (AUC) of 0.95 with a true positive rate (TPR) of 50% at a false positive rate (FPR) of <1%. To our knowledge, this is also the first dataset of interruption categories (including the failed interruption category) for remote meetings. Finally, we believe this is the first such system designed to improve meeting inclusiveness through speech interruption analysis and active intervention. △ Less

Submitted 4 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

arXiv:2303.13567 [pdf]

AI Models Close to your Chest: Robust Federated Learning Strategies for Multi-site CT

Authors: Edward H. Lee, Brendan Kelly, Emre Altinmakas, Hakan Dogan, Maryam Mohammadzadeh, Errol Colak, Steve Fu, Olivia Choudhury, Ujjwal Ratan, Felipe Kitamura, Hernan Chaves, Jimmy Zheng, Mourad Said, Eduardo Reis, Jaekwang Lim, Patricia Yokoo, Courtney Mitchell, Golnaz Houshmand, Marzyeh Ghassemi, Ronan Killeen, Wendy Qiu, Joel Hayden, Farnaz Rafiee, Chad Klochko, Nicholas Bevins , et al. (5 additional authors not shown)

Abstract: While it is well known that population differences from genetics, sex, race, and environmental factors contribute to disease, AI studies in medicine have largely focused on locoregional patient cohorts with less diverse data sources. Such limitation stems from barriers to large-scale data share and ethical concerns over data privacy. Federated learning (FL) is one potential pathway for AI developm… ▽ More While it is well known that population differences from genetics, sex, race, and environmental factors contribute to disease, AI studies in medicine have largely focused on locoregional patient cohorts with less diverse data sources. Such limitation stems from barriers to large-scale data share and ethical concerns over data privacy. Federated learning (FL) is one potential pathway for AI development that enables learning across hospitals without data share. In this study, we show the results of various FL strategies on one of the largest and most diverse COVID-19 chest CT datasets: 21 participating hospitals across five continents that comprise >10,000 patients with >1 million images. We also propose an FL strategy that leverages synthetically generated data to overcome class and size imbalances. We also describe the sources of data heterogeneity in the context of FL, and show how even among the correctly labeled populations, disparities can arise due to these biases. △ Less

Submitted 13 April, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

arXiv:2205.07099 [pdf, other]

doi 10.1109/TIP.2022.3215069

Differentiable SAR Renderer and SAR Target Reconstruction

Authors: Shilei Fu, Feng Xu

Abstract: Forward modeling of wave scattering and radar imaging mechanisms is the key to information extraction from synthetic aperture radar (SAR) images. Like inverse graphics in optical domain, an inherently-integrated forward-inverse approach would be promising for SAR advanced information retrieval and target reconstruction. This paper presents such an attempt to the inverse graphics for SAR imagery. A… ▽ More Forward modeling of wave scattering and radar imaging mechanisms is the key to information extraction from synthetic aperture radar (SAR) images. Like inverse graphics in optical domain, an inherently-integrated forward-inverse approach would be promising for SAR advanced information retrieval and target reconstruction. This paper presents such an attempt to the inverse graphics for SAR imagery. A differentiable SAR renderer (DSR) is developed which reformulates the mapping and projection algorithm of SAR imaging mechanism in the differentiable form of probability maps. First-order gradients of the proposed DSR are then analytically derived which can be back-propagated from rendered image/silhouette to the target geometry and scattering attributes. A 3D inverse target reconstruction algorithm from SAR images is devised. Several simulation and reconstruction experiments are conducted, including targets with and without background, using both synthesized data or real measured inverse SAR (ISAR) data by ground radar. Results demonstrate the efficacy of the proposed DSR and its inverse approach. △ Less

Submitted 14 May, 2022; originally announced May 2022.

arXiv:2204.03339 [pdf, other]

Boosting Self-Supervised Embeddings for Speech Enhancement

Authors: Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin

Abstract: Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representatio… ▽ More Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representation and spectrogram, the result can be significantly boosted. We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE. Consequently, we found that SSL representations with lower noise robustness are more important. Furthermore, our experiments on the VCTK-DEMAND dataset demonstrated that fine-tuning an SSL representation with an SE model can outperform the SOTA SSL-based SE methods in PESQ, CSIG and COVL without invoking complicated network architectures. In later experiments, the CN distance in SSL embeddings was observed to increase after fine-tuning. These results verify our expectations and may help design SE-related SSL training in the future. △ Less

Submitted 5 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: accepted to INTERSPEECH-2022

arXiv:2204.03310 [pdf, other]

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Authors: Ryandhimas E. Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Abstract: Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility sco… ▽ More Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility scores. This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures. Specifically, given a speech utterance, MTI-Net is designed to predict human subjective listening test results and word error rate (WER) scores. We also investigate several methods that can improve the prediction performance of MTI-Net. First, we compare different features (including low-level features and embeddings from self-supervised learning (SSL) models) and prediction targets of MTI-Net. Second, we explore the effect of transfer learning and multi-tasking learning on training MTI-Net. Finally, we examine the potential advantages of fine-tuning SSL embeddings. Experimental results demonstrate the effectiveness of using cross-domain features, multi-task learning, and fine-tuning SSL embeddings. Furthermore, it is confirmed that the intelligibility and WER scores predicted by MTI-Net are highly correlated with the ground-truth scores. △ Less

Submitted 30 August, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted to Interspeech 2022

arXiv:2203.17152 [pdf, other]

Perceptual Contrast Stretching on Target Feature for Speech Enhancement

Authors: Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao

Abstract: Speech enhancement (SE) performance has improved considerably owing to the use of deep learning models as a base function. Herein, we propose a perceptual contrast stretching (PCS) approach to further improve SE performance. The PCS is derived based on the critical band importance function and is applied to modify the targets of the SE model. Specifically, the contrast of target features is stretc… ▽ More Speech enhancement (SE) performance has improved considerably owing to the use of deep learning models as a base function. Herein, we propose a perceptual contrast stretching (PCS) approach to further improve SE performance. The PCS is derived based on the critical band importance function and is applied to modify the targets of the SE model. Specifically, the contrast of target features is stretched based on perceptual importance, thereby improving the overall SE performance. Compared with post-processing-based implementations, incorporating PCS into the training phase preserves performance and reduces online computation. Notably, PCS can be combined with different SE model architectures and training criteria. Furthermore, PCS does not affect the causality or convergence of SE model training. Experimental results on the VoiceBank-DEMAND dataset show that the proposed method can achieve state-of-the-art performance on both causal (PESQ score = 3.07) and noncausal (PESQ score = 3.35) SE tasks. △ Less

Submitted 15 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted by Interspeech 2022

arXiv:2203.06306 [pdf, other]

DURRNet: Deep Unfolded Single Image Reflection Removal Network

Authors: Jun-Jie Huang, Tianrui Liu, Zhixiong Yang, Shaojing Fu, Wentao Zhao, Pier Luigi Dragotti

Abstract: Single image reflection removal problem aims to divide a reflection-contaminated image into a transmission image and a reflection image. It is a canonical blind source separation problem and is highly ill-posed. In this paper, we present a novel deep architecture called deep unfolded single image reflection removal network (DURRNet) which makes an attempt to combine the best features from model-ba… ▽ More Single image reflection removal problem aims to divide a reflection-contaminated image into a transmission image and a reflection image. It is a canonical blind source separation problem and is highly ill-posed. In this paper, we present a novel deep architecture called deep unfolded single image reflection removal network (DURRNet) which makes an attempt to combine the best features from model-based and learning-based paradigms and therefore leads to a more interpretable deep architecture. Specifically, we first propose a model-based optimization with transform-based exclusion prior and then design an iterative algorithm with simple closed-form solutions for solving each sub-problems. With the deep unrolling technique, we build the DURRNet with ProxNets to model natural image priors and ProxInvNets which are constructed with invertible networks to impose the exclusion prior. Comprehensive experimental results on commonly used datasets demonstrate that the proposed DURRNet achieves state-of-the-art results both visually and quantitatively. △ Less

Submitted 11 March, 2022; originally announced March 2022.

arXiv:2111.05703 [pdf, other]

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Authors: Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

Abstract: Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified tra… ▽ More Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems. △ Less

Submitted 10 November, 2021; originally announced November 2021.

arXiv:2111.04436 [pdf, other]

SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points

Authors: Yu-Chen Lin, Cheng Yu, Yi-Te Hsu, Szu-Wei Fu, Yu Tsao, Tei-Wei Kuo

Abstract: Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point netwo… ▽ More Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point network (SEOFP-NET) technique is proposed to compress the model size and accelerate the inference time for speech enhancement, a regression task of speech signal processing. The proposed method compressed the sizes of deep neural network (DNN)-based speech enhancement models by quantizing the fraction bits of single-precision floating-point parameters during training. Before inference implementation, all parameters in the trained SEOFP-NET model are slightly adjusted to accelerate the inference time by replacing the floating-point multiplier with an integer-adder. For generalization, the SEOFP-NET technique is introduced to different speech enhancement tasks in speech signal processing with different model architectures under various corpora. The experimental results indicate that the size of SEOFP-NET models can be significantly compressed by up to 81.249% without noticeably downgrading their speech enhancement performance, and the inference time can be accelerated to 1.212x compared with the baseline models. The results also verify that the proposed SEOFP-NET can cooperate with other efficiency strategies to achieve a synergy effect for model compression. In addition, the just noticeable difference (JND) was applied to the user study experiment to statistically analyze the effect of speech enhancement on listening. The results indicate that the listeners cannot facilely differentiate between the enhanced speech signals processed by the baseline model and the proposed SEOFP-NET. △ Less

Submitted 8 November, 2021; originally announced November 2021.

arXiv:2111.02363 [pdf, other]

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Abstract: In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared t… ▽ More In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model. △ Less

Submitted 23 June, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

arXiv:2110.05866 [pdf]

MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

Authors: Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao

Abstract: Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to solve the pair constraint, they still require clean s… ▽ More Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to solve the pair constraint, they still require clean speech or noise for training. Therefore, in this paper, we propose MetricGAN-U, which stands for MetricGAN-unsupervised, to further release the constraint from conventional unsupervised learning. In MetricGAN-U, only noisy speech is required to train the model by optimizing non-intrusive speech quality metrics. The experimental results verified that MetricGAN-U outperforms baselines in both objective and subjective metrics. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2110.04084 [pdf, other]

doi 10.3390/photonics9120940

DeepGOMIMO: Deep Learning-Aided Generalized Optical MIMO with CSI-Free Blind Detection

Authors: Xin Zhong, Chen Chen, Shu Fu, Zhihong Zeng, Min Liu

Abstract: Generalized optical multiple-input multiple-output (GOMIMO) techniques have been recently shown to be promising for high-speed optical wireless communication (OWC) systems. In this paper, we propose a novel deep learning-aided GOMIMO (DeepGOMIMO) framework for GOMIMO systems, where channel state information (CSI)-free blind detection can be enabled by employing a specially designed deep neural net… ▽ More Generalized optical multiple-input multiple-output (GOMIMO) techniques have been recently shown to be promising for high-speed optical wireless communication (OWC) systems. In this paper, we propose a novel deep learning-aided GOMIMO (DeepGOMIMO) framework for GOMIMO systems, where channel state information (CSI)-free blind detection can be enabled by employing a specially designed deep neural network (DNN)-based MIMO detector. The CSI-free blind DNN detector mainly consists of two modules: one is the pre-processing module which is designed to address both the path loss and channel crosstalk issues caused by MIMO transmission, and the other is the feed-forward DNN module which is used for joint detection of spatial and constellation information by learning the statistics of both the input signal and the additive noise. Our simulation results clearly verify that, in a typical indoor 4 $\times$ 4 MIMO-OWC system using both generalized optical spatial modulation (GOSM) and generalized optical spatial multiplexing (GOSMP) with unipolar non-zero 4-ary pulse amplitude modulation (4-PAM) modulation, the proposed CSI-free blind DNN detector achieves near the same bit error rate (BER) performance as the optimal joint maximum-likelihood (ML) detector, but with much reduced computational complexity. Moreover, since the CSI-free blind DNN detector does not require instantaneous channel estimation to obtain accurate CSI, it enjoys the unique advantages of improved achievable data rate and reduced communication time delay in comparison to the CSI-based zero-forcing DNN (ZF-DNN) detector. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2106.12770 [pdf, other]

doi 10.1109/JPHOT.2021.3129541

Deep Learning-Aided OFDM-Based Generalized Optical Quadrature Spatial Modulation

Authors: Chen Chen, Lin Zeng, Xin Zhong, Shu Fu, Min Liu, Pengfei Du

Abstract: In this paper, we propose an orthogonal frequency division multiplexing (OFDM)-based generalized optical quadrature spatial modulation (GOQSM) technique for multiple-input multiple-output optical wireless communication (MIMO-OWC) systems. Considering the error propagation and noise amplification effects when applying maximum likelihood and maximum ratio combining (ML-MRC)-based detection, we furth… ▽ More In this paper, we propose an orthogonal frequency division multiplexing (OFDM)-based generalized optical quadrature spatial modulation (GOQSM) technique for multiple-input multiple-output optical wireless communication (MIMO-OWC) systems. Considering the error propagation and noise amplification effects when applying maximum likelihood and maximum ratio combining (ML-MRC)-based detection, we further propose a deep neural network (DNN)-aided detection for OFDM-based GOQSM systems. The proposed DNN-aided detection scheme performs the GOQSM detection in a joint manner, which can efficiently eliminate the adverse effects of both error propagation and noise amplification. The obtained simulation results successfully verify the superiority of the deep learning-aided OFDM-based GOQSM technique for high-speed MIMO-OWC systems. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Journal ref: IEEE Photonics Journal, 2022

arXiv:2106.04624 [pdf, other]

SpeechBrain: A General-Purpose Speech Toolkit

Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Preprint

arXiv:2105.01259 [pdf, ps, other]

doi 10.1109/TWC.2021.3080578

Collaborative Multi-Resource Allocation in Terrestrial-Satellite Network Towards 6G

Authors: Shu Fu, Jie Gao, Lian Zhao

Abstract: Terrestrial-satellite networks are envisioned to play a significant role in the sixth-generation (6G) wireless networks. In such networks, hot air balloons are useful as they can relay the signals between satellites and ground stations. Most existing works assume that the hot air balloons are deployed at the same height with the same minimum elevation angle to the satellites, which may not be prac… ▽ More Terrestrial-satellite networks are envisioned to play a significant role in the sixth-generation (6G) wireless networks. In such networks, hot air balloons are useful as they can relay the signals between satellites and ground stations. Most existing works assume that the hot air balloons are deployed at the same height with the same minimum elevation angle to the satellites, which may not be practical due to possible route conflict with airplanes and other flight equipment. In this paper, we consider a TSN containing hot air balloons at different heights and with different minimum elevation angles, which creates the challenge of non-uniform available serving time for the communication between the hot air balloons and the satellites. Jointly considering the caching, computing, and communication (3C) resource management for both the ground-balloon-satellite links and inter-satellite laser links, our objective is to maximize the network energy efficiency. Firstly, by proposing a tapped water-filling algorithm, we schedule the traffic to relay among satellites according to the available serving time of satellites. Then, we generate a series of configuration matrices, based on which we formulate the relationship of relay time and the power consumption involved in the relay among satellites. Finally, the integrated system model of TSN is built and solved by geometric programming with Taylor series approximation. Simulation results demonstrate the effectiveness of our proposed scheme. △ Less

Submitted 3 May, 2021; originally announced May 2021.

Journal ref: IEEE Transactions on Wireless Communications, 2021

arXiv:2104.07539 [pdf, other]

Multi-Agent Reinforcement Learning Based Coded Computation for Mobile Ad Hoc Computing

Authors: Baoqian Wang, Junfei Xie, Kejie Lu, Yan Wan, Shengli Fu

Abstract: Mobile ad hoc computing (MAHC), which allows mobile devices to directly share their computing resources, is a promising solution to address the growing demands for computing resources required by mobile devices. However, offloading a computation task from a mobile device to other mobile devices is a challenging task due to frequent topology changes and link failures because of node mobility, unsta… ▽ More Mobile ad hoc computing (MAHC), which allows mobile devices to directly share their computing resources, is a promising solution to address the growing demands for computing resources required by mobile devices. However, offloading a computation task from a mobile device to other mobile devices is a challenging task due to frequent topology changes and link failures because of node mobility, unstable and unknown communication environments, and the heterogeneous nature of these devices. To address these challenges, in this paper, we introduce a novel coded computation scheme based on multi-agent reinforcement learning (MARL), which has many promising features such as adaptability to network changes, high efficiency and robustness to uncertain system disturbances, consideration of node heterogeneity, and decentralized load allocation. Comprehensive simulation studies demonstrate that the proposed approach can outperform state-of-the-art distributed computing schemes. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2104.03538 [pdf]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Authors: Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

Abstract: The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discr… ▽ More The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15). △ Less

Submitted 4 June, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: Accepted by Interspeech 2021

arXiv:2103.12954 [pdf, ps, other]

Convergence Analysis of Nonconvex Distributed Stochastic Zeroth-order Coordinate Method

Authors: Shengjun Zhang, Yunlong Dong, Dong Xie, Lisha Yao, Colleen P. Bailey, Shengli Fu

Abstract: This paper investigates the stochastic distributed nonconvex optimization problem of minimizing a global cost function formed by the summation of $n$ local cost functions. We solve such a problem by involving zeroth-order (ZO) information exchange. In this paper, we propose a ZO distributed primal-dual coordinate method (ZODIAC) to solve the stochastic optimization problem. Agents approximate thei… ▽ More This paper investigates the stochastic distributed nonconvex optimization problem of minimizing a global cost function formed by the summation of $n$ local cost functions. We solve such a problem by involving zeroth-order (ZO) information exchange. In this paper, we propose a ZO distributed primal-dual coordinate method (ZODIAC) to solve the stochastic optimization problem. Agents approximate their own local stochastic ZO oracle along with coordinates with an adaptive smoothing parameter. We show that the proposed algorithm achieves the convergence rate of $\mathcal{O}(\sqrt{p}/\sqrt{T})$ for general nonconvex cost functions. We demonstrate the efficiency of proposed algorithms through a numerical example in comparison with the existing state-of-the-art centralized and distributed ZO algorithms. △ Less

Submitted 13 October, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

arXiv:2011.04292 [pdf]

STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model

Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Chiou-Shann Fuh, Yu Tsao, Hsin-Min Wang

Abstract: The calculation of most objective speech intelligibility assessment metrics requires clean speech as a reference. Such a requirement may limit the applicability of these metrics in real-world scenarios. To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The input and output of STOI-Net are speech spectral features a… ▽ More The calculation of most objective speech intelligibility assessment metrics requires clean speech as a reference. Such a requirement may limit the applicability of these metrics in real-world scenarios. To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The input and output of STOI-Net are speech spectral features and predicted STOI scores, respectively. The model is formed by the combination of a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism. Experimental results show that the STOI score estimated by STOI-Net has a good correlation with the actual STOI score when tested with noisy and enhanced speech utterances. The correlation values are 0.97 and 0.83, respectively, for the seen test condition (the test speakers and noise types are involved in the training set) and the unseen test condition (the test speakers and noise types are not involved in the training set). The results confirm the capability of STOI-Net to accurately predict the STOI scores without referring to clean speech. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted in APSIPA 2020

arXiv:2010.15174 [pdf, other]

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement

Authors: Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao

Abstract: Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information… ▽ More Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset. △ Less

Submitted 27 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

arXiv:2009.11975 [pdf, other]

CoFF: Cooperative Spatial Feature Fusion for 3D Object Detection on Autonomous Vehicles

Authors: Jingda Guo, Dominic Carrillo, Sihai Tang, Qi Chen, Qing Yang, Song Fu, Xi Wang, Nannan Wang, Paparao Palacharla

Abstract: To reduce the amount of transmitted data, feature map based fusion is recently proposed as a practical solution to cooperative 3D object detection by autonomous vehicles. The precision of object detection, however, may require significant improvement, especially for objects that are far away or occluded. To address this critical issue for the safety of autonomous vehicles and human beings, we prop… ▽ More To reduce the amount of transmitted data, feature map based fusion is recently proposed as a practical solution to cooperative 3D object detection by autonomous vehicles. The precision of object detection, however, may require significant improvement, especially for objects that are far away or occluded. To address this critical issue for the safety of autonomous vehicles and human beings, we propose a cooperative spatial feature fusion (CoFF) method for autonomous vehicles to effectively fuse feature maps for achieving a higher 3D object detection performance. Specially, CoFF differentiates weights among feature maps for a more guided fusion, based on how much new semantic information is provided by the received feature maps. It also enhances the inconspicuous features corresponding to far/occluded objects to improve their detection precision. Experimental results show that CoFF achieves a significant improvement in terms of both detection precision and effective detection range for autonomous vehicles, compared to previous feature fusion solutions. △ Less

Submitted 24 September, 2020; originally announced September 2020.

arXiv:2008.09264 [pdf, other]

doi 10.1109/ACCESS.2022.3153469

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Authors: Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao

Abstract: This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a… ▽ More This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable. △ Less

Submitted 25 April, 2022; v1 submitted 20 August, 2020; originally announced August 2020.

arXiv:2006.11139 [pdf, other]

Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

Authors: Cheng Yu, Kuo-Hsuan Hung, I-Fan Lin, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung

Abstract: In this study, we propose an encoder-decoder structured system with fully convolutional networks to implement voice activity detection (VAD) directly on the time-domain waveform. The proposed system processes the input waveform to identify its segments to be either speech or non-speech. This novel waveform-based VAD algorithm, with a short-hand notation "WVAD", has two main particularities. First,… ▽ More In this study, we propose an encoder-decoder structured system with fully convolutional networks to implement voice activity detection (VAD) directly on the time-domain waveform. The proposed system processes the input waveform to identify its segments to be either speech or non-speech. This novel waveform-based VAD algorithm, with a short-hand notation "WVAD", has two main particularities. First, as compared to most conventional VAD systems that use spectral features, raw-waveforms employed in WVAD contain more comprehensive information and thus are supposed to facilitate more accurate speech/non-speech predictions. Second, based on the multi-branched architecture, WVAD can be extended by using an ensemble of encoders, referred to as WEVAD, that incorporate multiple attribute information in utterances, and thus can yield better VAD performance for specified acoustic conditions. We evaluated the presented WVAD and WEVAD for the VAD task in two datasets: First, the experiments conducted on AURORA2 reveal that WVAD outperforms many state-of-the-art VAD algorithms. Next, the TMHINT task confirms that through combining multiple attributes in utterances, WEVAD behaves even better than WVAD. △ Less

Submitted 19 June, 2020; originally announced June 2020.

arXiv:2006.10296 [pdf]

Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

Authors: Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-Jin Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao

Abstract: The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To fur… ▽ More The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To further improve the perceptual evaluation of the speech quality (PESQ) scores of enhanced speech, the L_1 pre-trained Transformer is fine-tuned using a MetricGAN framework. The proposed MetricGAN can be treated as a general post-processing module to further boost the objective scores of interest. The experiments were conducted using the data sets provided by the organizer of the Deep Noise Suppression (DNS) challenge. Experimental results demonstrated that the proposed system outperformed the challenge baseline, in both subjective and objective evaluations, with a large margin. △ Less

Submitted 3 March, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted by APSIPA 2020

arXiv:2005.10104 [pdf, other]

doi 10.1109/TCOMM.2021.3051912

NOMA for Energy-Efficient LiFi-Enabled Bidirectional IoT Communication

Authors: Chen Chen, Shu Fu, Xin Jian, Min Liu, Xiong Deng, Zhiguo Ding

Abstract: In this paper, we consider a light fidelity (LiFi)-enabled bidirectional Internet of Things (IoT) communication system, where visible light and infrared light are used in the downlink and uplink, respectively. In order to improve the energy efficiency (EE) of the bidirectional LiFi-IoT system, non-orthogonal multiple access (NOMA) with a quality-of-service (QoS)-guaranteed optimal power allocation… ▽ More In this paper, we consider a light fidelity (LiFi)-enabled bidirectional Internet of Things (IoT) communication system, where visible light and infrared light are used in the downlink and uplink, respectively. In order to improve the energy efficiency (EE) of the bidirectional LiFi-IoT system, non-orthogonal multiple access (NOMA) with a quality-of-service (QoS)-guaranteed optimal power allocation (OPA) strategy is applied to maximize the EE of the system. We derive a closed-form OPA set based on the identification of the optimal decoding orders in both downlink and uplink channels, which can enable low-complexity power allocation. Moreover, we propose an adaptive channel and QoS-based user pairing approach by jointly considering users' channel gains and QoS requirements. We further analyze the EE of the bidirectional LiFi-IoT system and the user outage probabilities (UOPs) of both downlink and uplink channels of the system. Extensive analytical and simulation results demonstrate the superiority of NOMA with OPA in comparison to orthogonal multiple access (OMA) and NOMA with typical channel-based power allocation strategies. It is also shown that the proposed adaptive channel and QoS-based user pairing approach greatly outperforms individual channel/QoS-based approaches, especially when users have diverse QoS requirements. △ Less

Submitted 24 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

Journal ref: IEEE Transactions on Communications, 2021

arXiv:2004.00932 [pdf, other]

iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning

Authors: Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi

Abstract: The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN… ▽ More The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN approach to optimize the speech intelligibility metrics with generative adversarial networks (GANs). Experimental results show that the proposed iMetricGAN outperforms conventional state-of-the-art algorithms in terms of objective measures, i.e., speech intelligibility in bits (SIIB) and extended short-time objective intelligibility (ESTOI), under a Cafeteria noise condition. In addition, formal listening tests reveal significant intelligibility gains when both noise and reverberation exist. △ Less

Submitted 7 April, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: 5 pages, Submitted to INTERSPEECH 2020

arXiv:2003.00451 [pdf]

Weak Texture Information Map Guided Image Super-resolution with Deep Residual Networks

Authors: Bo Fu, Liyan Wang, Yuechu Wu, Yufeng Wu, Shilin Fu, Yonggong Ren

Abstract: Single image super-resolution (SISR) is an image processing task which obtains high-resolution (HR) image from a low-resolution (LR) image. Recently, due to the capability in feature extraction, a series of deep learning methods have brought important crucial improvement for SISR. However, we observe that no matter how deeper the networks are designed, they usually do not have good generalization… ▽ More Single image super-resolution (SISR) is an image processing task which obtains high-resolution (HR) image from a low-resolution (LR) image. Recently, due to the capability in feature extraction, a series of deep learning methods have brought important crucial improvement for SISR. However, we observe that no matter how deeper the networks are designed, they usually do not have good generalization ability, which leads to the fact that almost all of existing SR methods have poor performances on restoration of the weak texture details. To solve these problems, we propose a weak texture information map guided image super-resolution with deep residual networks. It contains three sub-networks, one main network which extracts the main features and fuses weak texture details, another two auxiliary networks extract the weak texture details fallen in the main network. Two part of networks work cooperatively, the auxiliary networks predict and integrates week texture information into the main network, which is conducive to the main network learning more inconspicuous details. Experiments results demonstrate that our method's performs achieve the state-of-the-art quantitatively. Specifically, the image super-resolution results of our method own more weak texture details. △ Less

Submitted 18 March, 2020; v1 submitted 1 March, 2020; originally announced March 2020.

arXiv:2002.01255 [pdf, other]

Revealing Much While Saying Less: Predictive Wireless for Status Update

Authors: Zhiyuan Jiang, Zixu Cao, Siyu Fu, Fei Peng, Shan Cao, Shunqing Zhang, Shugong Xu

Abstract: Wireless communications for status update are becoming increasingly important, especially for machine-type control applications. Existing work has been mainly focused on Age of Information (AoI) optimizations. In this paper, a status-aware predictive wireless interface design, networking and implementation are presented which aim to minimize the status recovery error of a wireless networked system… ▽ More Wireless communications for status update are becoming increasingly important, especially for machine-type control applications. Existing work has been mainly focused on Age of Information (AoI) optimizations. In this paper, a status-aware predictive wireless interface design, networking and implementation are presented which aim to minimize the status recovery error of a wireless networked system by leveraging online status model predictions. Two critical issues of predictive status update are addressed: practicality and usefulness. Link-level experiments on a Software-Defined-Radio (SDR) testbed are conducted and test results show that the proposed design can significantly reduce the number of wireless transmissions while maintaining a low status recovery error. A Status-aware Multi-Agent Reinforcement learning neTworking solution (SMART) is proposed to dynamically and autonomously control the transmit decisions of devices in an ad hoc network based on their individual statuses. System-level simulations of a multi dense platooning scenario are carried out on a road traffic simulator. Results show that the proposed schemes can greatly improve the platooning control performance in terms of the minimum safe distance between successive vehicles, in comparison with the AoI-optimized status-unaware and communication latency-optimized schemes---this demonstrates the usefulness of our proposed status update schemes in a real-world application. △ Less

Submitted 4 February, 2020; originally announced February 2020.

Comments: To appear in IEEE INFOCOM 2020

arXiv:2001.03030 [pdf]

Distributed Brillouin frequency shift extraction via a convolutional neural network

Authors: Yiqing Chang, Hao Wu, Can Zhao, Li Shen, Songnian Fu, Ming Tang

Abstract: Distributed optical fiber Brillouin sensors detect the temperature and strain along a fiber according to the local Brillouin frequency shift, which is usually calculated by the measured Brillouin spectrum using Lorentzian curve fitting. In addition, cross-correlation, principal component analysis, and machine learning methods have been proposed for the more efficient extraction of Brillouin freque… ▽ More Distributed optical fiber Brillouin sensors detect the temperature and strain along a fiber according to the local Brillouin frequency shift, which is usually calculated by the measured Brillouin spectrum using Lorentzian curve fitting. In addition, cross-correlation, principal component analysis, and machine learning methods have been proposed for the more efficient extraction of Brillouin frequency shifts. However, existing methods only process the Brillouin spectrum individually, ignoring the correlation in the time domain, indicating that there is still room for improvement. Here, we propose and experimentally demonstrate a full convolution neural network to extract the distributed Brillouin frequency shift directly from the measured two-dimensional data. Simulated ideal Brillouin spectrum with various parameters are used to train the network. Both the simulation and experimental results show that the extraction accuracy of the network is better than that of the traditional curve fitting algorithm with a much shorter processing time. This network has good universality and robustness and can effectively improve the performances of existing Brillouin sensors. △ Less

Submitted 9 January, 2020; originally announced January 2020.

arXiv:1912.01319 [pdf, other]

AI-Assisted Low Information Latency Wireless Networking

Authors: Zhiyuan Jiang, Siyu Fu, Sheng Zhou, Zhisheng Niu, Shunqing Zhang, Shugong Xu

Abstract: The 5G Phase-2 and beyond wireless systems will focus more on vertical applications such as autonomous driving and industrial Internet-of-things, many of which are categorized as ultra-Reliable Low-Latency Communications (uRLLC). In this article, an alternative view on uRLLC is presented, that information latency, which measures the distortion of information resulted from time lag of its acquisiti… ▽ More The 5G Phase-2 and beyond wireless systems will focus more on vertical applications such as autonomous driving and industrial Internet-of-things, many of which are categorized as ultra-Reliable Low-Latency Communications (uRLLC). In this article, an alternative view on uRLLC is presented, that information latency, which measures the distortion of information resulted from time lag of its acquisition process, is more relevant than conventional communication latency of uRLLC in wireless networked control systems. An AI-assisted Situationally-aware Multi-Agent Reinforcement learning framework for wireless neTworks (SMART) is presented to address the information latency optimization challenge. Case studies of typical applications in Autonomous Driving (AD) are demonstrated, i.e., dense platooning and intersection management, which show that SMART can effectively optimize information latency, and more importantly, information latency-optimized systems outperform conventional uRLLC-oriented systems significantly in terms of AD performance such as traffic efficiency, thus pointing out a new research and system design paradigm. △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: To appear in IEEE Wireless Communications

arXiv:1912.01080 [pdf, other]

Low-Latency High-Level Data Sharing for Connected and Autonomous Vehicular Networks

Authors: Qi Chen, Sihai Tang, Jacob Hochstetler, Jingda Guo, Yuan Li, Jinbo Xiong, Qing Yang, Song Fu

Abstract: Autonomous vehicles can combine their own data with that of other vehicles to enhance their perceptive ability, and thus improve detection accuracy and driving safety. Data sharing among autonomous vehicles, however, is a challenging problem due to the sheer volume of data generated by various types of sensors on the vehicles. In this paper, we propose a low-latency, high-level (L3) data sharing p… ▽ More Autonomous vehicles can combine their own data with that of other vehicles to enhance their perceptive ability, and thus improve detection accuracy and driving safety. Data sharing among autonomous vehicles, however, is a challenging problem due to the sheer volume of data generated by various types of sensors on the vehicles. In this paper, we propose a low-latency, high-level (L3) data sharing protocol for connected and autonomous vehicular networks. Based on the L3 protocol, sensing results generated by individual vehicles will be broadcasted simultaneously within a limited sensing zone. The L3 protocol reduces the networking latency by taking advantage of the capture effect of the wireless transmissions occurred among vehicles. With the proposed design principles, we implement and test the L3 protocol in a simulated environment. Simulation results demonstrate that the L3 protocol is able to achieve reliable and fast data sharing among autonomous vehicles. △ Less

Submitted 2 December, 2019; originally announced December 2019.

arXiv:1911.09847 [pdf, ps, other]

doi 10.1109/LSP.2020.3000968

Time-Domain Multi-modal Bone/air Conducted Speech Enhancement

Authors: Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung

Abstract: Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while ma… ▽ More Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results. △ Less

Submitted 17 June, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

Comments: multi-modal, bone/air-conducted signals, speech enhancement, fully convolutional network

Journal ref: IEEE Signal Processing Letters, vol. 27, pp. 1035-1039, 2020

arXiv:1910.06585 [pdf, other]

Hybrid Beamforming/Combining for Millimeter Wave MIMO: A Machine Learning Approach

Authors: Jiyun Tao, Jing Xing, Jienan Chen, Chuan Zhang, Shengli Fu

Abstract: Hybrid beamforming (HB) has emerged as a promising technology to support ultra high transmission capacity and with low complexity for Millimeter Wave (mmWave) multiple-input and multiple-output (MIMO) system. However, the design of digital and analog beamformer is a challenge task with non-convex optimization, especially for the multi-user scenario. Recently, the blooming of deep learning research… ▽ More Hybrid beamforming (HB) has emerged as a promising technology to support ultra high transmission capacity and with low complexity for Millimeter Wave (mmWave) multiple-input and multiple-output (MIMO) system. However, the design of digital and analog beamformer is a challenge task with non-convex optimization, especially for the multi-user scenario. Recently, the blooming of deep learning research provides a new vision for the signal processing of communication system. In this work, we propose a deep neural network based HB for the multi-User mmWave massive MIMO system, referred as DNHB. The HB system is formulated as an autoencoder neural network, which is trained in a style of end-to-end self-supervised learning. With the strong representation capability of deep neural network, the proposed DNHB exhibits superior performance than the traditional linear processing methods. According to the simulation results, DNHB outperforms about 2 dB in terms of bit error rate (BER) performance compared with existing methods. △ Less

Submitted 3 January, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: 5 pages, 4 figures

arXiv:1909.11919 [pdf, other]

A Study of Joint Effect on Denoising Techniques and Visual Cues to Improve Speech Intelligibility in Cochlear Implant Simulation

Authors: Rung-Yu Tseng, Tao-Wei Wang, Szu-Wei Fu, Chia-Ying Lee, Yu Tsao

Abstract: Speech perception is key to verbal communication. For people with hearing loss, the capability to recognize speech is restricted, particularly in a noisy environment or the situations without visual cues, such as lip-reading unavailable via phone call. This study aimed to understand the improvement of vocoded speech intelligibility in cochlear implant (CI) simulation through two potential methods:… ▽ More Speech perception is key to verbal communication. For people with hearing loss, the capability to recognize speech is restricted, particularly in a noisy environment or the situations without visual cues, such as lip-reading unavailable via phone call. This study aimed to understand the improvement of vocoded speech intelligibility in cochlear implant (CI) simulation through two potential methods: Speech Enhancement (SE) and Audiovisual Integration. A fully convolutional neural network (FCN) using an intelligibility-oriented objective function was recently proposed and proven to effectively facilitate the speech intelligibility as an advanced denoising SE approach. Furthermore, audiovisual integration is reported to supply better speech comprehension compared to audio-only information. An experiment was designed to test speech intelligibility using tone-vocoded speech in CI simulation with a group of normal-hearing listeners. Experimental results confirmed the effectiveness of the FCN-based denoising SE and audiovisual integration on vocoded speech. Also, it positively recommended that these two methods could become a blended feature in a CI processor to improve the speech intelligibility for CI users under noisy conditions. △ Less

Submitted 18 December, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

arXiv:1909.11912 [pdf]

Improving the Intelligibility of Electric and Acoustic Stimulation Speech Using Fully Convolutional Networks Based Speech Enhancement

Authors: Natalie Yu-Hsien Wang, Hsiao-Lan Sharon Wang, Tao-Wei Wang, Szu-Wei Fu, Xugan Lu, Yu Tsao, Hsin-Min Wang

Abstract: The combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhanceme… ▽ More The combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts. With evidence showing the benefits of FCN(S) for normal speech, this study sets out to assess its ability to improve the intelligibility of EAS simulated speech. Objective evaluations and listening tests were conducted to examine the performance of FCN(S) in improving the speech intelligibility of normal and vocoded speech in noisy environments. The experimental results show that, compared with the traditional minimum-mean square-error SE method and the deep denoising autoencoder SE method, FCN(S) can obtain better gain in the speech intelligibility for normal as well as vocoded speech. This study, being the first to evaluate deep learning SE approaches for EAS, confirms that FCN(S) is an effective SE approach that may potentially be integrated into an EAS processor to benefit users in noisy environments. △ Less

Submitted 26 September, 2019; originally announced September 2019.

arXiv:1909.11909 [pdf, other]

Multichannel Speech Enhancement by Raw Waveform-mapping using Fully Convolutional Networks

Authors: Chang-Le Liu, Sze-Wei Fu, You-Jin Li, Jen-Wei Huang, Hsin-Min Wang, Yu Tsao

Abstract: In recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are in waveform format, the waveform-mapping-based SE methods can overcome the distortion caused by imperfect phase estimation, which may be encountered… ▽ More In recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are in waveform format, the waveform-mapping-based SE methods can overcome the distortion caused by imperfect phase estimation, which may be encountered in spectral-mapping-based SE systems. So far, most waveform-mapping-based SE methods have focused on single-channel tasks. In this paper, we propose a novel fully convolutional network (FCN) with Sinc and dilated convolutional layers (termed SDFCN) for multichannel SE that operates in the time domain. We also propose an extended version of SDFCN, called the residual SDFCN (termed rSDFCN). The proposed methods are evaluated on two multichannel SE tasks, namely the dual-channel inner-ear microphones SE task and the distributed microphones SE task. The experimental results confirm the outstanding denoising capability of the proposed SE systems on both tasks and the benefits of using the residual architecture on the overall SE performance. △ Less

Submitted 24 February, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: Accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:1906.01078 [pdf, other]

doi 10.1109/LSP.2019.2951950

Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques

Authors: Jyun-Yi Wu, Cheng Yu, Szu-Wei Fu, Chih-Ting Liu, Shao-Yi Chien, Yu Tsao

Abstract: Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a paramet… ▽ More Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids. Because the techniques are derived based on different concepts, the PP and PQ can be integrated to provide even more compact SE models. The experimental results show that the PP and PQ techniques produce a compacted SE model with a size of only 10.03% compared to that of the original model, resulting in minor performance losses of 1.43% (from 0.70 to 0.69) for STOI and 3.24% (from 1.85 to 1.79) for PESQ. The promising results suggest that the PP and PQ techniques can be used in a SE system in devices with limited storage and computation resources. △ Less

Submitted 31 July, 2019; v1 submitted 31 May, 2019; originally announced June 2019.

Comments: 4pages, 6 figures

arXiv:1905.04874 [pdf, other]

MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement

Authors: Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, Shou-De Lin

Abstract: Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Mor… ▽ More Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Moreover, based on MetricGAN, the metric scores of the generated data can also be arbitrarily specified by users. We tested the proposed MetricGAN on a speech enhancement task, which is particularly suitable to verify the proposed approach because there are multiple metrics measuring different aspects of speech signals. Moreover, these metrics are generally complex and could not be fully optimized by Lp or conventional adversarial losses. △ Less

Submitted 13 May, 2019; originally announced May 2019.

Comments: Accepted by Thirty-sixth International Conference on Machine Learning (ICML) 2019

arXiv:1905.01898 [pdf]

doi 10.1109/LSP.2019.2953810

Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality

Authors: Szu-Wei Fu, Chien-Feng Liao, Yu Tsao

Abstract: Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently. The main reason is that the conventional mean squared error (MSE) loss cannot represent auditory perception well. One of the typical hu-man-perception-related metrics, which is the perceptual evaluation of speech quality (PESQ), has been proven to provide a high correlat… ▽ More Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently. The main reason is that the conventional mean squared error (MSE) loss cannot represent auditory perception well. One of the typical hu-man-perception-related metrics, which is the perceptual evaluation of speech quality (PESQ), has been proven to provide a high correlation to the quality scores rated by humans. Owing to its complex and non-differentiable properties, however, the PESQ function may not be used to optimize speech enhancement models directly. In this study, we propose optimizing the enhancement model with an approximated PESQ function, which is differentiable and learned from the training data. The experimental results show that the learned surrogate function can guide the enhancement model to further boost the PESQ score (in-crease of 0.18 points compared to the results trained with MSE loss) and maintain the speech intelligibility. △ Less

Submitted 14 November, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

Comments: Accepted by IEEE Signal Processing Letters (SPL)

arXiv:1904.08352 [pdf, other]

doi 10.21437/Interspeech.2019-2003

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

Authors: Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

Abstract: Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural netw… ▽ More Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed models are tested on large-scale listening test results of the Voice Conversion Challenge (VCC) 2018. Experimental results show that the predicted scores of the proposed MOSNet are highly correlated with human MOS ratings at the system level while being fairly correlated with human MOS ratings at the utterance level. Meanwhile, we have modified MOSNet to predict the similarity scores, and the preliminary results show that the predicted scores are also fairly correlated with human ratings. These results confirm that the proposed models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating. △ Less

Submitted 14 July, 2021; v1 submitted 17 April, 2019; originally announced April 2019.

Comments: Accepted to Interspeech2019

Showing 1–50 of 58 results for author: Fu, S