Search | arXiv e-print repository

HILCodec: High Fidelity and Lightweight Neural Audio Codec

Authors: Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, Chanyeong Moon, Nam Soo Kim

Abstract: The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of Wave-U-Net does not increase consist… ▽ More The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of Wave-U-Net does not increase consistently as the network depth increases. We analyze the root cause of such a phenomenon and suggest a variance-constrained design. Also, we reveal various distortions in previous waveform domain discriminators and propose a novel distortion-free discriminator. The resulting model, \textit{HILCodec}, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.03129 [pdf, other]

Active Sensing for Multiuser Beam Tracking with Reconfigurable Intelligent Surface

Authors: Han Han, Tao Jiang, Wei Yu

Abstract: This paper studies a beam tracking problem in which an access point (AP), in collaboration with a reconfigurable intelligent surface (RIS), dynamically adjusts its downlink beamformers and the reflection pattern at the RIS in order to maintain reliable communications with multiple mobile user equipments (UEs). Specifically, the mobile UEs send uplink pilots to the AP periodically during the channe… ▽ More This paper studies a beam tracking problem in which an access point (AP), in collaboration with a reconfigurable intelligent surface (RIS), dynamically adjusts its downlink beamformers and the reflection pattern at the RIS in order to maintain reliable communications with multiple mobile user equipments (UEs). Specifically, the mobile UEs send uplink pilots to the AP periodically during the channel sensing intervals, the AP then adaptively configures the beamformers and the RIS reflection coefficients for subsequent data transmission based on the received pilots. This is an active sensing problem, because channel sensing involves configuring the RIS coefficients during the pilot stage and the optimal sensing strategy should exploit the trajectory of channel state information (CSI) from previously received pilots. Analytical solution to such an active sensing problem is very challenging. In this paper, we propose a deep learning framework utilizing a recurrent neural network (RNN) to automatically summarize the time-varying CSI obtained from the periodically received pilots into state vectors. These state vectors are then mapped to the AP beamformers and RIS reflection coefficients for subsequent downlink data transmissions, as well as the RIS reflection coefficients for the next round of uplink channel sensing. The mappings from the state vectors to the downlink beamformers and the RIS reflection coefficients for both channel sensing and downlink data transmission are performed using graph neural networks (GNNs) to account for the interference among the UEs. Simulations demonstrate significant and interpretable performance improvement of the proposed approach over the existing data-driven methods with nonadaptive channel sensing schemes. △ Less

Submitted 31 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

arXiv:2404.17585 [pdf, other]

NeuroNet: A Novel Hybrid Self-Supervised Learning Framework for Sleep Stage Classification Using Single-Channel EEG

Authors: Cheol-Hui Lee, Hakseung Kim, Hyun-jee Han, Min-Kyung Jung, Byung C. Yoon, Dong-Joo Kim

Abstract: The classification of sleep stages is a pivotal aspect of diagnosing sleep disorders and evaluating sleep quality. However, the conventional manual scoring process, conducted by clinicians, is time-consuming and prone to human bias. Recent advancements in deep learning have substantially propelled the automation of sleep stage classification. Nevertheless, challenges persist, including the need fo… ▽ More The classification of sleep stages is a pivotal aspect of diagnosing sleep disorders and evaluating sleep quality. However, the conventional manual scoring process, conducted by clinicians, is time-consuming and prone to human bias. Recent advancements in deep learning have substantially propelled the automation of sleep stage classification. Nevertheless, challenges persist, including the need for large datasets with labels and the inherent biases in human-generated annotations. This paper introduces NeuroNet, a self-supervised learning (SSL) framework designed to effectively harness unlabeled single-channel sleep electroencephalogram (EEG) signals by integrating contrastive learning tasks and masked prediction tasks. NeuroNet demonstrates superior performance over existing SSL methodologies through extensive experimentation conducted across three polysomnography (PSG) datasets. Additionally, this study proposes a Mamba-based temporal context module to capture the relationships among diverse EEG epochs. Combining NeuroNet with the Mamba-based temporal context module has demonstrated the capability to achieve, or even surpass, the performance of the latest supervised learning methodologies, even with a limited amount of labeled data. This study is expected to establish a new benchmark in sleep stage classification, promising to guide future research and applications in the field of sleep analysis. △ Less

Submitted 13 May, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

Comments: 14 pages, 4 figures

arXiv:2403.14402 [pdf, other]

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Authors: HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v… ▽ More Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning. △ Less

Submitted 12 August, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: ACL2024

arXiv:2402.09797 [pdf, other]

A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings

Authors: Hyewon Han, Naveen Kumar

Abstract: In this work, we propose a novel cross-talk rejection framework for a multi-channel multi-talker setup for a live multiparty interactive show. Our far-field audio setup is required to be hands-free during live interaction and comprises four adjacent talkers with directional microphones in the same space. Such setups often introduce heavy cross-talk between channels, resulting in reduced automatic… ▽ More In this work, we propose a novel cross-talk rejection framework for a multi-channel multi-talker setup for a live multiparty interactive show. Our far-field audio setup is required to be hands-free during live interaction and comprises four adjacent talkers with directional microphones in the same space. Such setups often introduce heavy cross-talk between channels, resulting in reduced automatic speech recognition (ASR) and natural language understanding (NLU) performance. To address this problem, we propose voice activity detection (VAD) model for all talkers using multichannel information, which is then used to filter audio for downstream tasks. We adopt a synthetic training data generation approach through playback and re-recording for such scenarios, simulating challenging speech overlap conditions. We train our models on this synthetic data and demonstrate that our approach outperforms single-channel VAD models and energy-based multi-channel VAD algorithm in various acoustic environments. In addition to VAD results, we also present multiparty ASR evaluation results to highlight the impact of using our VAD model for filtering audio in downstream tasks by significantly reducing the insertion error. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Accepted for presentation at the Hands-free Speech Communication and Microphone Arrays (HSCMA 2024)

arXiv:2402.00744 [pdf, other]

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Authors: Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li

Abstract: With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment betw… ▽ More With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.01685 [pdf]

Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation

Authors: Hua Han, Cheng Li, Lei Xie, Yuanjing Feng, Alou Diakite, Shanshan Wang

Abstract: Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utili… ▽ More Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utilizes multi-modal magnetic resonance (MR) imaging information to enhance RGVP segmentation. Our ME-Net has two main contributions. Firstly, we introduce an effective multi-modal soft-exchange technique. Specifically, we design a channel and spatially mixed attention module to exchange modality information between T1-weighted and fractional anisotropy MR images. Secondly, we propose a cross-fusion module that further enhances the fusion of information between the two modalities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in terms of RGVP segmentation performance. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01654 [pdf, other]

LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

Authors: Alou Diakite, Cheng Li, Lei Xie, Yuanjing Feng, Hua Han, Shanshan Wang

Abstract: Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN… ▽ More Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN incorporates supervised and unsupervised losses, enabling the student and teacher models to mutually learn from each other, forming a self-ensembling mean teacher framework. Additionally, we introduce a reliable unlabeled sample selection (RUSS) mechanism to further enhance LESEN's effectiveness. Our experiments on the human connectome project (HCP) dataset demonstrate the superior performance of our method when compared to state-of-the-art techniques, advancing multimodal VP segmentation for comprehensive analysis in clinical and research settings. The implementation code will be available at: https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway- Delineation. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2312.10472 [pdf, other]

Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System

Authors: Ruining Zhang, Haoran Han, Maolong Lv, Qisong Yang, Jian Cheng

Abstract: Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state sp… ▽ More Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state spaces, using a novel analysis technique known as state division. In contrast to prior approaches that employ state division merely as a post-hoc explanatory tool, our methodology delves into the intrinsic characteristics of DRL policy networks. Specifically, we demonstrate that the expansion of state space induces the activation function $\tanh$ to exhibit saturability, resulting in the transformation of the state division boundary from nonlinear to linear. Our analysis centers on the paradigm of the double-integrator system, revealing that this gradual shift towards linearity imparts a control behavior reminiscent of bang-bang control. However, the inherent linearity of the division boundary prevents the attainment of an ideal bang-bang control, thereby introducing unavoidable overshooting. Our experimental investigations, employing diverse RL algorithms, establish that this performance phenomenon stems from inherent attributes of the DRL policy network, remaining consistent across various optimization algorithms. △ Less

Submitted 31 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

arXiv:2312.09446 [pdf, other]

A Distributed Inference System for Detecting Task-wise Single Trial Event-Related Potential in Stream of Satellite Images

Authors: Sung-Jin Kim, Heon-Gyu Kwak, Hyeon-Taek Han, Dae-Hyeok Lee, Ji-Hoon Jeong, Seong-Whan Lee

Abstract: Brain-computer interface (BCI) has garnered the significant attention for their potential in various applications, with event-related potential (ERP) performing a considerable role in BCI systems. This paper introduces a novel Distributed Inference System tailored for detecting task-wise single-trial ERPs in a stream of satellite images. Unlike traditional methodologies that employ a single model… ▽ More Brain-computer interface (BCI) has garnered the significant attention for their potential in various applications, with event-related potential (ERP) performing a considerable role in BCI systems. This paper introduces a novel Distributed Inference System tailored for detecting task-wise single-trial ERPs in a stream of satellite images. Unlike traditional methodologies that employ a single model for target detection, our system utilizes multiple models, each optimized for specific tasks, ensuring enhanced performance across varying image transition times and target onset times. Our experiments, conducted on four participants, employed two paradigms: the Normal paradigm and an AI paradigm with bounding boxes. Results indicate that our proposed system outperforms the conventional methods in both paradigms, achieving the highest $F_β$ scores. Furthermore, including bounding boxes in the AI paradigm significantly improved target recognition. This study underscores the potential of our Distributed Inference System in advancing the field of ERP detection in satellite image streams. △ Less

Submitted 10 November, 2023; originally announced December 2023.

arXiv:2312.06065 [pdf, other]

EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings

Authors: Sung Hwan Mun, Min Hyun Han, Canyeong Moon, Nam Soo Kim

Abstract: In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity… ▽ More In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity. EEND-DEMUX can directly obtain separated speaker embeddings through the demultiplexing operation in the inference phase without an external speaker diarization system, an embedding extractor, or a heuristic decoding technique. Furthermore, we employ a multi-head cross-attention mechanism to capture the correlation between mixture and separated speaker embeddings effectively. We formulate three loss functions based on matching, orthogonality, and sparsity constraints to learn robust demultiplexed speaker embeddings. The experimental results on the LibriMix dataset show consistently improved performance in both a fixed and flexible number of speakers scenarios. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: Submitted to IEEE Signal Processing Letters

arXiv:2311.14213 [pdf, other]

doi 10.1109/TASLP.2024.3393738

Learning to Solve Inverse Problems for Perceptual Sound Matching

Authors: Han Han, Vincent Lostanlen, Mathieu Lagrange

Abstract: Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is… ▽ More Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice. △ Less

Submitted 6 May, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2307.13821 [pdf, other]

doi 10.1109/waspaa58266.2023.10248131

Fitting Auditory Filterbanks with Multiresolution Neural Networks

Authors: Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, Martin Ehler

Abstract: Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optima… ▽ More Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optimal time-frequency localization; yet, this strong inductive bias comes at the detriment of representational capacity. In this paper, we aim to overcome this dilemma by introducing a neural audio model, named multiresolution neural network (MuReNN). The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT). Since the scale of DWT atoms grows exponentially between octaves, the receptive fields of the subsequent learnable convolutions in MuReNN are dilated accordingly. For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank: Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively. This is a form of knowledge distillation (KD), in which the filterbank ''teacher'' is engineered by domain knowledge while the neural network ''student'' is optimized from data. We compare MuReNN to the state of the art in terms of goodness of fit after KD on a hold-out set and in terms of Heisenberg time-frequency localization. Compared to convnets and Gabor convolutions, we find that MuReNN reaches state-of-the-art performance on all three optimization problems. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: 4 pages, 4 figures, 1 table, conference

Journal ref: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2023)

arXiv:2307.13220 [pdf]

One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction

Authors: Zi Wang, Xiaotong Yu, Chengyan Wang, Weibo Chen, Jiazheng Wang, Ying-Hua Chu, Hongwei Sun, Rushuai Li, Peiyong Li, Fan Yang, Haiwei Han, Taishan Kang, Jianzhong Lin, Chen Yang, Shufu Chang, Zhang Shi, Sha Hua, Yan Li, Juan Hu, Liuhong Zhu, Jianjun Zhou, Meijing Lin, Jiefeng Guo, Congbo Cai, Zhong Chen , et al. (3 additional authors not shown)

Abstract: Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep… ▽ More Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep Learning (DL) has proven effective for fast MRI image reconstruction, its broader applicability across various imaging scenarios has been constrained. Challenges include the high cost and privacy restrictions associated with acquiring large-scale, diverse training data, coupled with the inherent difficulty of addressing mismatches between training and target data in existing DL methodologies. Here, we present a novel Physics-Informed Synthetic data learning framework for Fast MRI, called PISF. PISF marks a breakthrough by enabling generalized DL for multi-scenario MRI reconstruction through a single trained model. Our approach separates the reconstruction of a 2D image into many 1D basic problems, commencing with 1D data synthesis to facilitate generalization. We demonstrate that training DL models on synthetic data, coupled with enhanced learning techniques, yields in vivo MRI reconstructions comparable to or surpassing those of models trained on matched realistic datasets, reducing the reliance on real-world MRI data by up to 96%. Additionally, PISF exhibits remarkable generalizability across multiple vendors and imaging centers. Its adaptability to diverse patient populations has been validated through evaluations by ten experienced medical professionals. PISF presents a feasible and cost-effective way to significantly boost the widespread adoption of DL in various fast MRI applications. △ Less

Submitted 28 February, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

Comments: 38 pages, 19 figures, 5 tables

arXiv:2306.01411 [pdf, other]

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

Authors: Doyeon Kim, Soo-Whan Chung, Hyewon Han, Youna Ji, Hong-Goo Kang

Abstract: This paper introduces an end-to-end neural speech restoration model, HD-DEMUCS, demonstrating efficacy across multiple distortion environments. Unlike conventional approaches that employ cascading frameworks to remove undesirable noise first and then restore missing signal components, our model performs these tasks in parallel using two heterogeneous decoder networks. Based on the U-Net style enco… ▽ More This paper introduces an end-to-end neural speech restoration model, HD-DEMUCS, demonstrating efficacy across multiple distortion environments. Unlike conventional approaches that employ cascading frameworks to remove undesirable noise first and then restore missing signal components, our model performs these tasks in parallel using two heterogeneous decoder networks. Based on the U-Net style encoder-decoder framework, we attach an additional decoder so that each decoder network performs noise suppression or restoration separately. We carefully design each decoder architecture to operate appropriately depending on its objectives. Additionally, we improve performance by leveraging a learnable weighting factor, aggregating the two decoder output waveforms. Experimental results with objective metrics across various environments clearly demonstrate the effectiveness of our approach over a single decoder or multi-stage systems for general speech restoration task. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2305.19051 [pdf, other]

Towards single integrated spoofing-aware speaker verification embeddings

Authors: Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe… ▽ More This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge. △ Less

Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

arXiv:2301.10183 [pdf, other]

Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis

Authors: Cyrus Vahidi, Han Han, Changhong Wang, Mathieu Lagrange, György Fazekas, Vincent Lostanlen

Abstract: Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoe… ▽ More Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2301.02886 [pdf, other]

Perceptual-Neural-Physical Sound Matching

Authors: Han Han, Vincent Lostanlen, Mathieu Lagrange

Abstract: Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric… ▽ More Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions. △ Less

Submitted 13 March, 2023; v1 submitted 7 January, 2023; originally announced January 2023.

arXiv:2212.13544 [pdf, other]

Enhancing Federated Learning with spectrum allocation optimization and device selection

Authors: Tinghao Zhang, Kwok-Yan Lam, Jun Zhao, Feng Li, Huimei Han, Norziana Jamil

Abstract: Machine learning (ML) is a widely accepted means for supporting customized services for mobile devices and applications. Federated Learning (FL), which is a promising approach to implement machine learning while addressing data privacy concerns, typically involves a large number of wireless mobile devices to collect model training data. Under such circumstances, FL is expected to meet stringent tr… ▽ More Machine learning (ML) is a widely accepted means for supporting customized services for mobile devices and applications. Federated Learning (FL), which is a promising approach to implement machine learning while addressing data privacy concerns, typically involves a large number of wireless mobile devices to collect model training data. Under such circumstances, FL is expected to meet stringent training latency requirements in the face of limited resources such as demand for wireless bandwidth, power consumption, and computation constraints of participating devices. Due to practical considerations, FL selects a portion of devices to participate in the model training process at each iteration. Therefore, the tasks of efficient resource management and device selection will have a significant impact on the practical uses of FL. In this paper, we propose a spectrum allocation optimization mechanism for enhancing FL over a wireless mobile network. Specifically, the proposed spectrum allocation optimization mechanism minimizes the time delay of FL while considering the energy consumption of individual participating devices; thus ensuring that all the participating devices have sufficient resources to train their local models. In this connection, to ensure fast convergence of FL, a robust device selection is also proposed to help FL reach convergence swiftly, especially when the local datasets of the devices are not independent and identically distributed (non-iid). Experimental results show that (1) the proposed spectrum allocation optimization method optimizes time delay while satisfying the individual energy constraints; (2) the proposed device selection method enables FL to achieve the fastest convergence on non-iid datasets. △ Less

Submitted 27 December, 2022; originally announced December 2022.

Comments: This paper is accepted by IEEE/ACM Transactions on Networking

arXiv:2211.08783 [pdf]

Uncertainty-Aware Multi-Parametric Magnetic Resonance Image Information Fusion for 3D Object Segmentation

Authors: Cheng Li, Yousuf Babiker M. Osman, Weijian Huang, Zhenzhen Xue, Hua Han, Hairong Zheng, Shanshan Wang

Abstract: Multi-parametric magnetic resonance (MR) imaging is an indispensable tool in the clinic. Consequently, automatic volume-of-interest segmentation based on multi-parametric MR imaging is crucial for computer-aided disease diagnosis, treatment planning, and prognosis monitoring. Despite the extensive studies conducted in deep learning-based medical image analysis, further investigations are still req… ▽ More Multi-parametric magnetic resonance (MR) imaging is an indispensable tool in the clinic. Consequently, automatic volume-of-interest segmentation based on multi-parametric MR imaging is crucial for computer-aided disease diagnosis, treatment planning, and prognosis monitoring. Despite the extensive studies conducted in deep learning-based medical image analysis, further investigations are still required to effectively exploit the information provided by different imaging parameters. How to fuse the information is a key question in this field. Here, we propose an uncertainty-aware multi-parametric MR image feature fusion method to fully exploit the information for enhanced 3D image segmentation. Uncertainties in the independent predictions of individual modalities are utilized to guide the fusion of multi-modal image features. Extensive experiments on two datasets, one for brain tissue segmentation and the other for abdominal multi-organ segmentation, have been conducted, and our proposed method achieves better segmentation performance when compared to existing models. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:2210.02732 [pdf, other]

Fully Unsupervised Training of Few-shot Keyword Spotting

Authors: Dongjune Lee, Minchan Kim, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

Abstract: For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric le… ▽ More For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric learning enabling target keywords to be detected using distance metrics. Exploiting the speech synthesis model that generates speech with pseudo phonemes instead of texts, we easily obtain a large collection of multi-view samples with the same semantics. These samples are sufficient for training, considering metric learning does not intrinsically necessitate labeled data. All of the components in our framework do not require any supervision, making our method unsupervised. Experimental results on real datasets show our proposed method is competitive even without any labeled and real datasets. △ Less

Submitted 6 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: Accepted by IEEE SLT 2022

arXiv:2209.14900 [pdf, other]

Joint Optimization of Energy Consumption and Completion Time in Federated Learning

Authors: Xinyu Zhou, Jun Zhao, Huimei Han, Claude Guet

Abstract: Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. Th… ▽ More Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. The optimization variables include bandwidth, transmission power and CPU frequency of each device in the FL system, where all devices are linked to a base station and train a global model collaboratively. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, and CPU frequency for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm not only has better performance at different weight parameters (i.e., different demands) but also outperforms the state of the art. △ Less

Submitted 10 March, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: This paper appears in the Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS) 2022. Please feel free to contact us for questions or remarks

arXiv:2209.13871 [pdf, ps, other]

Resource Allocation and Resolution Control in the Metaverse with Mobile Augmented Reality

Authors: Peiyuan Si, Jun Zhao, Huimei Han, Kwok-Yan Lam, Yang Liu

Abstract: With the development of blockchain and communication techniques, the Metaverse is considered as a promising next-generation Internet paradigm, which enables the connection between reality and the virtual world. The key to rendering a virtual world is to provide users with immersive experiences and virtual avatars, which is based on virtual reality (VR) technology and high data transmission rate. H… ▽ More With the development of blockchain and communication techniques, the Metaverse is considered as a promising next-generation Internet paradigm, which enables the connection between reality and the virtual world. The key to rendering a virtual world is to provide users with immersive experiences and virtual avatars, which is based on virtual reality (VR) technology and high data transmission rate. However, current VR devices require intensive computation and communication, and users suffer from high delay while using wireless VR devices. To build the connection between reality and the virtual world with current technologies, mobile augmented reality (MAR) is a feasible alternative solution due to its cheaper communication and computation cost. This paper proposes an MAR-based connection model for the Metaverse, and proposes a communication resources allocation algorithm based on outer approximation (OA) to achieve the best utility. Simulation results show that our proposed algorithm is able to provide users with basic MAR services for the Metaverse, and outperforms the benchmark greedy algorithm. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: A full paper published in IEEE Global Communications Conference (GLOBECOM) 2022

arXiv:2208.08012 [pdf, other]

Disentangled Speaker Representation Learning via Mutual Information Minimization

Authors: Sung Hwan Mun, Min Hyun Han, Minchan Kim, Dongjune Lee, Nam Soo Kim

Abstract: Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive lo… ▽ More Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance. △ Less

Submitted 12 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: Accepted by APSIPA ASC 2022. Camera-ready. 8 pages, 4 figures, and 1 table

arXiv:2206.15400 [pdf, other]

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Authors: Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang

Abstract: In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attenti… ▽ More In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines. △ Less

Submitted 1 July, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2204.08269 [pdf, other]

Differentiable Time-Frequency Scattering on GPU

Authors: John Muradeli, Cyrus Vahidi, Changhong Wang, Han Han, Vincent Lostanlen, Mathieu Lagrange, George Fazekas

Abstract: Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet… ▽ More Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time-frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds. △ Less

Submitted 19 July, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: 8 pages, 6 figures. Submitted to the International Conference on Digital Audio Effects (DAFX) 2022

arXiv:2204.01005 [pdf, other]

Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

Authors: Sung Hwan Mun, Jee-weon Jung, Min Hyun Han, Nam Soo Kim

Abstract: The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional lay… ▽ More The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion. It is based on an attention mechanism which exploits both frequency and channel domain. We first apply existing SKA module to our baseline. Then we propose two SKA variants where the first variant is applied in front of the ECAPA-TDNN model and the other is combined with the Res2net backbone block. Through extensive experiments, we demonstrate that our two proposed SKA variants consistently improves the performance and are complementary when tested on three different evaluation protocols. △ Less

Submitted 12 October, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: Accepted by IEEE SLT 2022. 7 pages, 4 figures, 1 table. Code is available at https://github.com/msh9184/ska-tdnn.git

arXiv:2203.07373 [pdf, other]

SATr: Slice Attention with Transformer for Universal Lesion Detection

Authors: Han Li, Long Chen, Hu Han, S. Kevin Zhou

Abstract: Universal Lesion Detection (ULD) in computed tomography plays an essential role in computer-aided diagnosis. Promising ULD results have been reported by multi-slice-input detection approaches which model 3D context from multiple adjacent CT slices, but such methods still experience difficulty in obtaining a global representation among different slices and within each individual slice since they on… ▽ More Universal Lesion Detection (ULD) in computed tomography plays an essential role in computer-aided diagnosis. Promising ULD results have been reported by multi-slice-input detection approaches which model 3D context from multiple adjacent CT slices, but such methods still experience difficulty in obtaining a global representation among different slices and within each individual slice since they only use convolution-based fusion operations. In this paper, we propose a novel Slice Attention Transformer (SATr) block which can be easily plugged into convolution-based ULD backbones to form hybrid network structures. Such newly formed hybrid backbones can better model long-distance feature dependency via the cascaded self-attention modules in the Transformer block while still holding a strong power of modeling local features with the convolutional operations in the original backbone. Experiments with five state-of-the-art methods show that the proposed SATr block can provide an almost free boost to lesion detection accuracy without extra hyperparameters or special network designs. △ Less

Submitted 12 March, 2022; originally announced March 2022.

Comments: 11 pages, 3 figures

arXiv:2203.06967 [pdf, other]

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Authors: Zejin Wang, Jiazheng Liu, Guoqing Li, Hua Han

Abstract: Real noisy-clean pairs on a large scale are costly and difficult to obtain. Meanwhile, supervised denoisers trained on synthetic data perform poorly in practice. Self-supervised denoisers, which learn only from single noisy images, solve the data collection problem. However, self-supervised denoising methods, especially blindspot-driven ones, suffer sizable information loss during input or network… ▽ More Real noisy-clean pairs on a large scale are costly and difficult to obtain. Meanwhile, supervised denoisers trained on synthetic data perform poorly in practice. Self-supervised denoisers, which learn only from single noisy images, solve the data collection problem. However, self-supervised denoising methods, especially blindspot-driven ones, suffer sizable information loss during input or network design. The absence of valuable information dramatically reduces the upper bound of denoising performance. In this paper, we propose a simple yet efficient approach called Blind2Unblind to overcome the information loss in blindspot-driven denoising methods. First, we introduce a global-aware mask mapper that enables global perception and accelerates training. The mask mapper samples all pixels at blind spots on denoised volumes and maps them to the same channel, allowing the loss function to optimize all blind spots at once. Second, we propose a re-visible loss to train the denoising network and make blind spots visible. The denoiser can learn directly from raw noise images without losing information or being trapped in identity mapping. We also theoretically analyze the convergence of the re-visible loss. Extensive experiments on synthetic and real-world datasets demonstrate the superior performance of our approach compared to previous work. Code is available at https://github.com/demonsjin/Blind2Unblind. △ Less

Submitted 7 May, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR2022

arXiv:2202.11918 [pdf, other]

Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Authors: Doyeon Kim, Hyewon Han, Hyeon-Kyeong Shin, Soo-Whan Chung, Hong-Goo Kang

Abstract: Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective… ▽ More Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective phase reconstruction strategy for neural speech enhancement that can operate in noisy environments. Specifically, we introduce a phase continuity loss that considers relative phase variations across the time and frequency axes. By including this phase continuity loss in a state-of-the-art neural speech enhancement system trained with reconstruction loss and a number of magnitude spectral losses, we show that our proposed method further improves the quality of enhanced speech signals over the baseline, especially when training is done jointly with a magnitude spectrum loss. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2112.08929 [pdf, other]

doi 10.1109/ACCESS.2021.3137190

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification

Authors: Sung Hwan Mun, Min Hyun Han, Dongjune Lee, Jihwan Kim, Nam Soo Kim

Abstract: In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In t… ▽ More In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF. △ Less

Submitted 24 December, 2021; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: Accepted by IEEE Access

arXiv:2111.00428 [pdf, other]

Reconfigurable Intelligent Surface-induced Randomness for mmWave Key Generation

Authors: Shubo Yang, Han Han, Yihong Liu, Weisi Guo, Zhibo Pang, Lei Zhang

Abstract: Secret key generation in physical layer security exploits the unpredictable random nature of wireless channels. The millimeter-wave (mmWave) channels have limited multipath and channel randomness in static environments. In this paper, for mmWave secret key generation of physical layer security, we use a reconfigurable intelligent surface (RIS) to induce randomness directly in wireless environments… ▽ More Secret key generation in physical layer security exploits the unpredictable random nature of wireless channels. The millimeter-wave (mmWave) channels have limited multipath and channel randomness in static environments. In this paper, for mmWave secret key generation of physical layer security, we use a reconfigurable intelligent surface (RIS) to induce randomness directly in wireless environments, without adding complexity to transceivers. We consider RIS to have continuous individual phase shifts (CIPS) and derive the RIS-assisted reflection channel distribution with its parameters. Then, we propose continuous group phase shifts (CGPS) to increase the randomness specifically at legal parties. Since the continuous phase shifts are expensive to implement, we analyze discrete individual phase shifts (DIPS) and derive the corresponding channel distribution, which is dependent on the quantization bit. We then derive the secret key rate (SKR) to evaluate the randomness performance. With the simulation results verifying the analytical results, this work explains the mathematical principles and lays a foundation for future mmWave evaluation and optimization of artificial channel randomness. △ Less

Submitted 8 August, 2022; v1 submitted 31 October, 2021; originally announced November 2021.

Comments: Add contents, including continuous group phase shifts and secret key rate analysis

arXiv:2106.15345 [pdf, other]

Where is the disease? Semi-supervised pseudo-normality synthesis from an abnormal image

Authors: Yuanqi Du, Quan Quan, Hu Han, S. Kevin Zhou

Abstract: Pseudo-normality synthesis, which computationally generates a pseudo-normal image from an abnormal one (e.g., with lesions), is critical in many perspectives, from lesion detection, data augmentation to clinical surgery suggestion. However, it is challenging to generate high-quality pseudo-normal images in the absence of the lesion information. Thus, expensive lesion segmentation data have been in… ▽ More Pseudo-normality synthesis, which computationally generates a pseudo-normal image from an abnormal one (e.g., with lesions), is critical in many perspectives, from lesion detection, data augmentation to clinical surgery suggestion. However, it is challenging to generate high-quality pseudo-normal images in the absence of the lesion information. Thus, expensive lesion segmentation data have been introduced to provide lesion information for the generative models and improve the quality of the synthetic images. In this paper, we aim to alleviate the need of a large amount of lesion segmentation data when generating pseudo-normal images. We propose a Semi-supervised Medical Image generative LEarning network (SMILE) which not only utilizes limited medical images with segmentation masks, but also leverages massive medical images without segmentation masks to generate realistic pseudo-normal images. Extensive experiments show that our model outperforms the best state-of-the-art model by up to 6% for data augmentation task and 3% in generating high-quality images. Moreover, the proposed semi-supervised learning achieves comparable medical image synthesis quality with supervised learning model, using only 50 of segmentation data. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2106.06455 [pdf, ps, other]

Certifying the LTL Formula p Until q in Hybrid Systems

Authors: Hyejin Han, Mohamed Maghenem, Ricardo G. Sanfelice

Abstract: In this paper, we propose sufficient conditions to guarantee that a linear temporal logic (LTL) formula of the form p Until q, denoted by $p \mathcal{U} q$, is satisfied for a hybrid system. Roughly speaking, the formula $p \mathcal{U} q$ is satisfied means that the solutions, initially satisfying proposition p, keep satisfying this proposition until proposition q is satisfied. To certify such a f… ▽ More In this paper, we propose sufficient conditions to guarantee that a linear temporal logic (LTL) formula of the form p Until q, denoted by $p \mathcal{U} q$, is satisfied for a hybrid system. Roughly speaking, the formula $p \mathcal{U} q$ is satisfied means that the solutions, initially satisfying proposition p, keep satisfying this proposition until proposition q is satisfied. To certify such a formula, connections to invariance notions such as conditional invariance (CI) and eventual conditional invariance (ECI), as well as finite-time attractivity (FTA) are established. As a result, sufficient conditions involving the data of the hybrid system and an appropriate choice of Lyapunov-like functions, such as barrier functions, are derived. The considered hybrid system is given in terms of differential and difference inclusions, which capture the continuous and the discrete dynamics present in the same system, respectively. Examples illustrate the results throughout the paper. △ Less

Submitted 17 August, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: 21 pages. The technical report accompanying "Certifying the LTL Formula p Until q in Hybrid Systems" submitted to IEEE Transactions on Automatic Control, 2021

arXiv:2103.00829 [pdf, ps, other]

6G Downlink Transmission via Rate Splitting Space Division Multiple Access Based on Grouped Code Index Modulation

Authors: Wenchao Zhai, Yishan Wu, Jun Zhao, Huimei Han

Abstract: A novel rate splitting space division multiple access (SDMA) scheme based on grouped code index modulation (GrCIM) is proposed for the sixth generation (6G) downlink transmission. The proposed RSMA-GrCIM scheme transmits information to multiple user equipments (UEs) through the space division multiple access (SDMA) technique, and exploits code index modulation for rate splitting. Since the CIM sch… ▽ More A novel rate splitting space division multiple access (SDMA) scheme based on grouped code index modulation (GrCIM) is proposed for the sixth generation (6G) downlink transmission. The proposed RSMA-GrCIM scheme transmits information to multiple user equipments (UEs) through the space division multiple access (SDMA) technique, and exploits code index modulation for rate splitting. Since the CIM scheme conveys information bits via the index of the selected Walsh code and binary phase shift keying (BPSK) signal, our RSMA scheme transmits the private messages of each user through the indices, and the common messages via the BPSK signal. Moreover, the Walsh code set is grouped into several orthogonal subsets to eliminate the interference from other users. A maximum likelihood (ML) detector is used to recovery the source bits, and a mathematical analysis is provided for the upper bound bit error ratio (BER) of each user. Comparisons are also made between our proposed scheme and the traditional SDMA scheme in spectrum utilization, number of available UEs, etc. Numerical results are given to verify the effectiveness of the proposed SDMA-GrCIM scheme. △ Less

Submitted 1 March, 2021; originally announced March 2021.

arXiv:2101.06421 [pdf, ps, other]

Smart City Enabled by 5G/6G Networks: An Intelligent Hybrid Random Access Scheme

Authors: Huimei Han, Wenchao Zhai, Jun Zhao

Abstract: The Internet of Things (IoT) is the enabler for smart city to achieve the envision of the "Internet of Everything" by intelligently connecting devices without human interventions. The explosive growth of IoT devices makes the amount of business data generated by machine-type communications (MTC) account for a great proportion in all communication services. The fifth-generation (5G) specification f… ▽ More The Internet of Things (IoT) is the enabler for smart city to achieve the envision of the "Internet of Everything" by intelligently connecting devices without human interventions. The explosive growth of IoT devices makes the amount of business data generated by machine-type communications (MTC) account for a great proportion in all communication services. The fifth-generation (5G) specification for cellular networks defines two types of application scenarios for MTC: One is massive machine type communications (mMTC) requiring massive connections, while the other is ultra-reliable low latency communications (URLLC) requiring high reliability and low latency communications. 6G, as the next generation beyond 5G, will have even stronger scales of mMTC and URLLC. mMTC and URLLC will co-exist in MTC networks for 5G 6G-enabled smart city. To enable massive and reliable LLC access to such heterogeneous MTC networks where mMTC and URLLC co-exist, in this article, we introduce the network architecture of heterogeneous MTC networks, and propose an intelligent hybrid random access scheme for 5G/6G-enabled smart city. Numerical results show that, compared to the benchmark schemes, the proposed scheme significantly improves the successful access probability, and satisfies the diverse quality of services requirements of URLLC and mMTC devices. △ Less

Submitted 5 May, 2022; v1 submitted 16 January, 2021; originally announced January 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2012.13537

arXiv:2012.13539 [pdf, ps, other]

A GCICA Grant-Free Random Access Scheme for M2M Communications in Crowded Massive MIMO Systems

Authors: Huimei Han, Lushun Fang, Weidang Lu, Wenchao Zhai, Ying Li, Jun Zhao

Abstract: A high success rate of grant-free random access scheme is proposed to support massive access for machine-to-machine communications in massive multipleinput multiple-output systems. This scheme allows active user equipments (UEs) to transmit their modulated uplink messages along with super pilots consisting of multiple sub-pilots to a base station (BS). Then, the BS performs channel state informati… ▽ More A high success rate of grant-free random access scheme is proposed to support massive access for machine-to-machine communications in massive multipleinput multiple-output systems. This scheme allows active user equipments (UEs) to transmit their modulated uplink messages along with super pilots consisting of multiple sub-pilots to a base station (BS). Then, the BS performs channel state information (CSI) estimation and uplink message decoding by utilizing a proposed graph combined clustering independent component analysis (GCICA) decoding algorithm, and then employs the estimated CSIs to detect active UEs by utilizing the characteristic of asymptotic favorable propagation of massive MIMO channel. We call this proposed scheme as GCICA based random access (GCICA-RA) scheme. We analyze the successful access probability, missed detection probability, and uplink throughput of the GCICA-RA scheme. Numerical results show that, the GCICA-RA scheme significantly improves the successful access probability and uplink throughput, decreases missed detection probability, and provides low CSI estimation error at the same time. △ Less

Submitted 25 December, 2020; originally announced December 2020.

arXiv:2012.13537 [pdf, ps, other]

An LSTM-Aided Hybrid Random Access Scheme for 6G Machine Type Communication Networks

Authors: Wenchao Zhai, Huimei Han, Lei Liu, Jun Zhao

Abstract: In this paper, an LSTM-aided hybrid random access scheme (LSTMH-RA) is proposed to support diverse quality of service (QoS) requirements in 6G machine-type communication (MTC) networks, where massive MTC (mMTC) devices and ultra-reliable low latency communications (URLLC) devices coexist. In the proposed LSTMH-RA scheme, mMTC devices access the network via a timing advance (TA)-aided four-step pro… ▽ More In this paper, an LSTM-aided hybrid random access scheme (LSTMH-RA) is proposed to support diverse quality of service (QoS) requirements in 6G machine-type communication (MTC) networks, where massive MTC (mMTC) devices and ultra-reliable low latency communications (URLLC) devices coexist. In the proposed LSTMH-RA scheme, mMTC devices access the network via a timing advance (TA)-aided four-step procedure to meet massive access requirement, while the access procedure of the URLLC devices is completed in two steps coupled with the mMTC devices' access procedure to reduce latency. Furthermore, we propose an attention-based LSTM prediction model to predict the number of active URLLC devices, thereby determining the parameters of the multi-user detection algorithm to guarantee the latency and reliability access requirements of URLLC devices. We analyze the successful access probability of the LSTMH-RA scheme. Numerical results show that, compared with the benchmark schemes, the proposed LSTMH-RA scheme can significantly improve the successful access probability, and thus satisfy the diverse QoS requirements of URLLC and mMTC devices. △ Less

Submitted 29 July, 2022; v1 submitted 25 December, 2020; originally announced December 2020.

arXiv:2010.11433 [pdf, other]

Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning

Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

Abstract: In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the pr… ▽ More In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% and 4.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 5 pages, 1 figure, 4 tables

arXiv:2010.11408 [pdf, ps, other]

Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020

Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

Abstract: This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods… ▽ More This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge's evaluation subset. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: Accepted in INTERSPEECH 2020

arXiv:2008.03024 [pdf, other]

doi 10.1109/ACCESS.2020.3012893

Disentangled speaker and nuisance attribute embedding for robust speaker verification

Authors: Woo Hyun Kang, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

Abstract: Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states)… ▽ More Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: Accepted in IEEE Access

arXiv:2008.01698 [pdf, other]

MIRNet: Learning multiple identities representations in overlapped speech

Authors: Hyewon Han, Soo-Whan Chung, Hong-Goo Kang

Abstract: Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speak… ▽ More Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from an overlapped speech. We design a network that can extract a high-level embedding that contains information about each speaker's identity from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels of the overlapped speech segments. We demonstrate the effectiveness and usefulness of our algorithm in a speaker verification task and a speech separation system conditioned on the target speaker embeddings obtained through the proposed method. △ Less

Submitted 6 August, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted in Interspeech 2020

arXiv:2007.10299 [pdf, other]

wav2shape: Hearing the Shape of a Drum Machine

Authors: Han Han, Vincent Lostanlen

Abstract: Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time--frequency analysis and supervised machine learning. We start by synthesizing a dataset… ▽ More Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time--frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in terms of its time-invariant scattering transform coefficients and formulate the parametric estimation of the resonator as multidimensional regression with a deep convolutional neural network. We interpolate scattering coefficients over the surface of the drum as a surrogate for potentially missing data, and study the response of the neural network to interpolated samples. Lastly, we resynthesize drum sounds from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: 11 pages, 7 figures. To appear in the Proceedings of Forum Acusticum, Lyon (France), December 2020

arXiv:2007.09383 [pdf, other]

Bounding Maps for Universal Lesion Detection

Authors: Han Li, Hu Han, S. Kevin Zhou

Abstract: Universal Lesion Detection (ULD) in computed tomography plays an essential role in computer-aided diagnosis systems. Many detection approaches achieve excellent results for ULD using possible bounding boxes (or anchors) as proposals. However, empirical evidence shows that using anchor-based proposals leads to a high false-positive (FP) rate. In this paper, we propose a box-to-map method to represe… ▽ More Universal Lesion Detection (ULD) in computed tomography plays an essential role in computer-aided diagnosis systems. Many detection approaches achieve excellent results for ULD using possible bounding boxes (or anchors) as proposals. However, empirical evidence shows that using anchor-based proposals leads to a high false-positive (FP) rate. In this paper, we propose a box-to-map method to represent a bounding box with three soft continuous maps with bounds in x-, y- and xy- directions. The bounding maps (BMs) are used in two-stage anchor-based ULD frameworks to reduce the FP rate. In the 1 st stage of the region proposal network, we replace the sharp binary ground-truth label of anchors with the corresponding xy-direction BM hence the positive anchors are now graded. In the 2 nd stage, we add a branch that takes our continuous BMs in x- and y- directions for extra supervision of detailed locations. Our method, when embedded into three state-of-the-art two-stage anchor-based detection methods, brings a free detection accuracy improvement (e.g., a 1.68% to 3.85% boost of sensitivity at 4 FPs) without extra inference time. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: 11 pages, 4 figures

arXiv:2007.06370 [pdf, other]

A novel random access scheme for M2M communication in crowded asynchronous massive MIMO systems

Authors: Huimei Han, Wenchao Zhai, Zhefu Wu, Ying Li, Jun Zhao, Mingda Chen

Abstract: A new random access scheme is proposed to solve the intra-cell pilot collision for M2M communication in crowded asynchronous massive multiple-input multiple-output (MIMO) systems. The proposed scheme utilizes the proposed estimation of signal parameters via rotational invariance technique enhanced (ESPRIT-E) method to estimate the effective timing offsets, and then active UEs obtain their timing e… ▽ More A new random access scheme is proposed to solve the intra-cell pilot collision for M2M communication in crowded asynchronous massive multiple-input multiple-output (MIMO) systems. The proposed scheme utilizes the proposed estimation of signal parameters via rotational invariance technique enhanced (ESPRIT-E) method to estimate the effective timing offsets, and then active UEs obtain their timing errors from the effective timing offsets for uplink message transmission. We analyze the mean squared error of the estimated effective timing offsets of UEs, and the uplink throughput. Simulation results show that, compared to the exiting random access scheme for the crowded asynchronous massive MIMO systems, the proposed scheme can improve the uplink throughput and estimate the effective timing offsets accurately at the same time. △ Less

Submitted 13 July, 2020; originally announced July 2020.

arXiv:2004.14774 [pdf, other]

IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition Report

Authors: Qi She, Fan Feng, Qi Liu, Rosa H. M. Chan, Xinyue Hao, Chuanlin Lan, Qihan Yang, Vincenzo Lomonaco, German I. Parisi, Heechul Bae, Eoin Brophy, Baoquan Chen, Gabriele Graffieti, Vidit Goel, Hyonyoung Han, Sathursan Kanagarajah, Somesh Kumar, Siew-Kei Lam, Tin Lun Lam, Liang Ma, Davide Maltoni, Lorenzo Pellegrini, Duvindu Piyasena, Shiliang Pu, Debdoot Sheet , et al. (11 additional authors not shown)

Abstract: This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, w… ▽ More This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, with everyday objects in home, office, campus, and mall scenarios. The dataset explicitly quantifies the variants of illumination, object occlusion, object size, camera-object distance/angles, and clutter information. Rules are designed to quantify the learning capability of the robotic vision system when faced with the objects appearing in the dynamic environments in the contest. Individual reports, dataset information, rules, and released source code can be found at the project homepage: "https://lifelong-robotic-vision.github.io/competition/". △ Less

Submitted 26 April, 2020; originally announced April 2020.

Comments: 9 pages, 11 figures, 3 tables, accepted into IEEE Robotics and Automation Magazine. arXiv admin note: text overlap with arXiv:1911.06487

arXiv:2001.02813 [pdf, ps, other]

doi 10.1109/LCOMM.2019.2952375

Design of QAM-FBMC Waveforms Considering MMSE Receiver

Authors: Hyungsik Han, Namshik Kim, Hyuncheol Park

Abstract: Due to its high spectral confinement characteristics and spectral efficiency, QAM-FBMC is considered a candidate waveform to replace CP-OFDM. QAM-FBMC has inevitable non-orthogonality both in time and frequency, and the system and filter must be well-designed to minimize the interferences. However, existing QAM-FBMC studies utilize a matched filter as the receiver filter, which is not suitable for… ▽ More Due to its high spectral confinement characteristics and spectral efficiency, QAM-FBMC is considered a candidate waveform to replace CP-OFDM. QAM-FBMC has inevitable non-orthogonality both in time and frequency, and the system and filter must be well-designed to minimize the interferences. However, existing QAM-FBMC studies utilize a matched filter as the receiver filter, which is not suitable for a non-orthogonal system. Therefore, in this paper, we design the prototype filters considering the MMSE criterion, and propose a system providing the highest SINR in QAM-FBMC which cannot avoid non-orthogonality. In addition, we confirm that the proposed filters show best performance at target SNR than the reference filters. △ Less

Submitted 8 January, 2020; originally announced January 2020.

Comments: 5 pages, 10 figures, Accepted to IEEE Communications Letters

arXiv:1912.03468 [pdf, ps, other]

Reconfigurable Intelligent Surface Aided Power Control for Physical-Layer Broadcasting

Authors: Huimei Han, Jun Zhao, Zehui Xiong, Dusit Niyato, Wenchao Zhai, Marco Di Renzo, Quoc-Viet Pham, Weidang Lu

Abstract: Reconfigurable intelligent surface (RIS), a recently introduced technology for future wireless com-munication systems, enhances the spectral and energy efficiency by intelligently adjusting the propaga-tion conditions between a base station (BS) and mobile equipments (MEs). An RIS consists of manylow-cost passive reflecting elements to improve the quality of the received signal. In this paper, wes… ▽ More Reconfigurable intelligent surface (RIS), a recently introduced technology for future wireless com-munication systems, enhances the spectral and energy efficiency by intelligently adjusting the propaga-tion conditions between a base station (BS) and mobile equipments (MEs). An RIS consists of manylow-cost passive reflecting elements to improve the quality of the received signal. In this paper, westudy the problem of power control at the BS for the RIS aided physical-layer broadcasting. Our goalis to minimize the transmit power at the BS by jointly designing the transmit beamforming at the BSand the phase shifts of the passive elements at the RIS. Furthermore, to help validate the proposedoptimization methods, we derive lower bounds to quantify the average transmit power at the BS as afunction of the number of MEs, the number of RIS elements, and the number of antennas at the BS.The simulation results demonstrated that the average transmit power at the BS is close to the lowerbound in an RIS aided system, and is significantly lower than the average transmit power in conventionalschemes without the RIS. △ Less

Submitted 5 March, 2022; v1 submitted 7 December, 2019; originally announced December 2019.

arXiv:1911.04283 [pdf, other]

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Authors: Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, Chanwoo Kim

Abstract: End-to-end Speech Translation (ST) models have several advantages such as lower latency, smaller model size, and less error compounding over conventional pipelines that combine Automatic Speech Recognition (ASR) and text Machine Translation (MT) models. However, collecting large amounts of parallel data for ST task is more difficult compared to the ASR and MT tasks. Previous studies have proposed… ▽ More End-to-end Speech Translation (ST) models have several advantages such as lower latency, smaller model size, and less error compounding over conventional pipelines that combine Automatic Speech Recognition (ASR) and text Machine Translation (MT) models. However, collecting large amounts of parallel data for ST task is more difficult compared to the ASR and MT tasks. Previous studies have proposed the use of transfer learning approaches to overcome the above difficulty. These approaches benefit from weakly supervised training data, such as ASR speech-to-transcript or MT text-to-text translation pairs. However, the parameters in these models are updated independently of each task, which may lead to sub-optimal solutions. In this work, we adopt a meta-learning algorithm to train a modality agnostic multi-task model that transfers knowledge from source tasks=ASR+MT to target task=ST where ST task severely lacks data. In the meta-learning phase, the parameters of the model are exposed to vast amounts of speech transcripts (e.g., English ASR) and text translations (e.g., English-German MT). During this phase, parameters are updated in such a way to understand speech, text representations, the relation between them, as well as act as a good initialization point for the target ST task. We evaluate the proposed meta-learning approach for ST tasks on English-German (En-De) and English-French (En-Fr) language pairs from the Multilingual Speech Translation Corpus (MuST-C). Our method outperforms the previous transfer learning approaches and sets new state-of-the-art results for En-De and En-Fr ST tasks by obtaining 9.18, and 11.76 BLEU point improvements, respectively. △ Less

Submitted 27 April, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

Comments: ICASSP 2020

arXiv:1910.14383 [pdf, ps, other]

Intelligent Reflecting Surface Aided Network: Power Control for Physical-Layer Broadcasting

Authors: Huimei Han, Jun Zhao, Dusit Niyato, Marco Di Renzo, Quoc-Viet Pham

Abstract: As a recently proposed idea for future wireless systems, intelligent reflecting surface (IRS) can assist communications between entities which do not have high-quality direct channels in between. Specifically, an IRS comprises many low-cost passive elements, each of which reflects the incident signal by incurring a phase change so that the reflected signals add coherently at the receiver. In this… ▽ More As a recently proposed idea for future wireless systems, intelligent reflecting surface (IRS) can assist communications between entities which do not have high-quality direct channels in between. Specifically, an IRS comprises many low-cost passive elements, each of which reflects the incident signal by incurring a phase change so that the reflected signals add coherently at the receiver. In this paper, for an IRS-aided wireless network, we study the problem of power control at the base station (BS) for physical-layer broadcasting under quality of service (QoS) constraints at mobile users, by jointly designing the transmit beamforming at the BS and the phase shifts of the IRS units. Furthermore, we derive a lower bound of the minimum transmit power at the BS to present the performance bound for optimization methods. Simulation results show that, the transmit power at the BS approaches the lower bound with the increase of the number of IRS units, and is much lower than that of the communication system without IRS. △ Less

Submitted 27 January, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

Comments: This paper appears in the Proceedings of IEEE International Conference on Communications (ICC) 2020

Showing 1–50 of 61 results for author: Han, H