Search | arXiv e-print repository

wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

Authors: Khai Le-Duc, Quy-Anh Dang, Tan-Hanh Pham, Truong-Son Hy

Abstract: Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our… ▽ More Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: Preprint, 32 pages

arXiv:2408.02990 [pdf, ps, other]

Joint Design of Probabilistic Constellation Shaping and Precoding for Multi-user VLC Systems

Authors: Thang K. Nguyen, Thanh V. Pham, Hoang D. Le, Chuyen T. Nguyen, Anh T. Pham

Abstract: This paper proposes a joint design of probabilistic constellation shaping (PCS) and precoding to enhance the sum-rate performance of multi-user visible light communications (VLC) broadcast channels subject to signal amplitude constraint. In the proposed design, the transmission probabilities of bipolar $M$-pulse amplitude modulation ($M$-PAM) symbols for each user and the transmit precoding matrix… ▽ More This paper proposes a joint design of probabilistic constellation shaping (PCS) and precoding to enhance the sum-rate performance of multi-user visible light communications (VLC) broadcast channels subject to signal amplitude constraint. In the proposed design, the transmission probabilities of bipolar $M$-pulse amplitude modulation ($M$-PAM) symbols for each user and the transmit precoding matrix are jointly optimized to improve the sum-rate performance. The joint design problem is shown to be a complex non-convex problem due to the non-convexity of the objective function. To tackle the problem, the firefly algorithm (FA), a nature-inspired heuristic optimization approach, is employed to solve a local optima to the original non-convex optimization problem. The FA-based approach, however, suffers from high computational complexity. Therefore, we propose a low-complexity design based on zero-forcing (ZF) precoding, which is solved using an alternating optimization (AO) approach. Simulation results reveal that the proposed joint design with PCS significantly improves the sum-rate performance compared to the conventional design with uniform signaling. Some insights into the optimal symbol distributions of the two joint design approaches are also provided. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2408.02982 [pdf, ps, other]

Practical Design of Probabilistic Constellation Shaping for Physical Layer Security in Visible Light Communications

Authors: Thanh V. Pham, Susumu Ishihara

Abstract: This paper studies a practical design of probabilistic constellation shaping (PCS) for physical layer security in visible light communications (VLC). In particular, we consider a wiretap VLC channel employing a probabilistically shaped $M$-ary pulse amplitude modulation (PAM) constellation. Considering the requirements for reliability of the legitimate user's channel, flickering-free transmission,… ▽ More This paper studies a practical design of probabilistic constellation shaping (PCS) for physical layer security in visible light communications (VLC). In particular, we consider a wiretap VLC channel employing a probabilistically shaped $M$-ary pulse amplitude modulation (PAM) constellation. Considering the requirements for reliability of the legitimate user's channel, flickering-free transmission, and symmetric constellation distribution, the optimal constellation distributions to maximize modulation-constrained secrecy capacity or the bit error rate (BER) of eavesdropper's channel are investigated for both scenarios of known and unknown eavesdropper's channel state information (CSI). To formulate the constraint on the channel reliability, tractable closed-form expressions for the upper bound and approximate BER of $M$-ary PAM under an arbitrary symbol probability are derived. The design problem is shown to be non-convex due to the non-convex BER constraint. By proving that the upper bound BER is a concave function of the constellation distribution, a suboptimal solution based on the convex-concave procedure (CCCP) is presented. Our findings reveal that while the uniform signaling can only satisfy the BER constraint when the optical power is beyond a certain value, the proposed PCS design works in the entire region of the optical power. Some insights into the optimal constellation distribution with respect to the emitted optical power are also discussed. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2407.12064 [pdf, other]

LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task

Authors: Khai Le-Duc, Ryan Zhang, Ngoc Son Nguyen, Tan-Hanh Pham, Anh Dao, Ba Hung Ngo, Anh Totti Nguyen, Truong-Son Hy

Abstract: Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To… ▽ More Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Preprint, 19 pages

arXiv:2407.01963 [pdf, other]

Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders

Authors: Phat Lam, Lam Pham, Tin Nguyen, Hieu Tang, Thinh Pham, Loi Khanh Nguyen, Alexander Schindler

Abstract: Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diariz… ▽ More Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diarization system for multilingual telephone call applications. The proposed system supports multiple languages and does not require large-scale annotated data for the training process as leveraging the multilingual Whisper model to extract speaker embeddings and proposing a novel Mixture of Sparse Autoencoders (Mix-SAE) network architecture for unsupervised speaker clustering. Experimental results on the evaluating dataset derived from two-speaker subsets of CALLHOME and CALLFRIEND telephonic speech corpora demonstrate superior efficiency of the proposed Mix-SAE network to other autoencoder-based clustering methods. The overall performance of our proposed system also indicates the promising potential of our approach in developing unsupervised multilingual speaker diarization applications within the context of limited annotated data and enhancing the integration ability into comprehensive multi-task speech analysis systems (i.e. multiple tasks of speech-to-text, language detection, speaker diarization integrated in a low-complexity system). △ Less

Submitted 7 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: 8 pages, 7 figures

arXiv:2402.15677 [pdf, other]

Consensus seeking in diffusive multidimensional networks with a repeated interaction pattern and time-delays

Authors: Hoang Huy Vu, Quyen Ngoc Nguyen, Chuong Van Nguyen, Tuynh Van Pham, Minh Hoang Trinh

Abstract: This paper studies a consensus problem in multidimensional networks having the same agent-to-agent interaction pattern under both intra- and cross-layer time delays. Several conditions for the agents to globally asymptotically achieve a consensus are derived, which involve the overall network's structure, the local interacting pattern, and the values of the time delays. The validity of these condi… ▽ More This paper studies a consensus problem in multidimensional networks having the same agent-to-agent interaction pattern under both intra- and cross-layer time delays. Several conditions for the agents to globally asymptotically achieve a consensus are derived, which involve the overall network's structure, the local interacting pattern, and the values of the time delays. The validity of these conditions is proved by direct eigenvalue evaluation and supported by numerical simulations. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 6 pages, 7 figures, submitted to a journal

arXiv:2402.13554 [pdf, ps, other]

Secrecy Performance Analysis of Space-to-Ground Optical Satellite Communications

Authors: Thang V. Nguyen, Thanh V. Pham, Anh T. Pham, Dang T. Ngoc

Abstract: Free-space optics (FSO)-based satellite communication systems have recently received considerable attention due to their enhanced capacity compared to their radio frequency (RF) counterparts. This paper analyzes the performance of physical layer security of space-to-ground intensity modulation/direct detection FSO satellite links under the effect of atmospheric loss, misalignment, cloud attenuatio… ▽ More Free-space optics (FSO)-based satellite communication systems have recently received considerable attention due to their enhanced capacity compared to their radio frequency (RF) counterparts. This paper analyzes the performance of physical layer security of space-to-ground intensity modulation/direct detection FSO satellite links under the effect of atmospheric loss, misalignment, cloud attenuation, and atmospheric turbulence-induced fading. Specifically, a wiretap channel consisting of a legitimate transmitter Alice (i.e., the satellite), a legitimate user Bob, and an eavesdropper Eve over turbulence channels modeled by the Fisher-Snedecor $\mathcal{F}$ distribution is considered. The secrecy performance in terms of the average secrecy capacity, secrecy outage probability, and strictly positive secrecy capacity are derived in closed-form. Simulation results reveal significant impacts of satellite altitude, zenith angle, and turbulence strength on the secrecy performance. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.13549 [pdf, ps, other]

Q-learning-based Joint Design of Adaptive Modulation and Precoding for Physical Layer Security in Visible Light Communications

Authors: Duc M. T. Hoang, Thanh V. Pham, Anh T. Pham, Chuyen T Nguyen

Abstract: There has been an increasing interest in physical layer security (PLS), which, compared with conventional cryptography, offers a unique approach to guaranteeing information confidentiality against eavesdroppers. In this paper, we study a joint design of adaptive $M$-ary pulse amplitude modulation (PAM) and precoding, which aims to optimize wiretap visible-light channels' secrecy capacity and bit e… ▽ More There has been an increasing interest in physical layer security (PLS), which, compared with conventional cryptography, offers a unique approach to guaranteeing information confidentiality against eavesdroppers. In this paper, we study a joint design of adaptive $M$-ary pulse amplitude modulation (PAM) and precoding, which aims to optimize wiretap visible-light channels' secrecy capacity and bit error rate (BER) performances. The proposed design is motivated by higher-order modulation, which results in better secrecy capacity at the expense of a higher BER. On the other hand, a proper precoding design, which can manipulate the received signal quality at the legitimate user and the eavesdropper, can also enhance secrecy performance and influence the BER. A reward function that considers the secrecy capacity and the BERs of the legitimate user's (Bob) and the eavesdropper's (Eve) channels is introduced and maximized. Due to the non-linearity and complexity of the reward function, it is challenging to solve the optical design using classical optimization techniques. Therefore, reinforcement learning-based designs using Q-learning and Deep Q-learning are proposed to maximize the reward function. Simulation results verify that compared with the baseline designs, the proposed joint designs achieve better reward values while maintaining the BER of Bob's channel (Eve's channel) well below (above) the pre-FEC (forward error correction) BER threshold. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.06226 [pdf]

N-1 Reduced Optimal Power Flow Using Augmented Hierarchical Graph Neural Network

Authors: Thuan Pham, Xingpeng Li

Abstract: Optimal power flow (OPF) is used to perform generation redispatch in power system real-time operations. N-1 OPF can ensure safe grid operations under diverse contingency scenarios. For large and intricate power networks with numerous variables and constraints, achieving an optimal solution for real-time N-1 OPF necessitates substantial computational resources. To mitigate this challenge, machine l… ▽ More Optimal power flow (OPF) is used to perform generation redispatch in power system real-time operations. N-1 OPF can ensure safe grid operations under diverse contingency scenarios. For large and intricate power networks with numerous variables and constraints, achieving an optimal solution for real-time N-1 OPF necessitates substantial computational resources. To mitigate this challenge, machine learning (ML) is introduced as an additional tool for predicting congested or heavily loaded lines dynamically. In this paper, an advanced ML model known as the augmented hierarchical graph neural network (AHGNN) was proposed to predict critical congested lines and create N-1 reduced OPF (N-1 ROPF). The proposed AHGNN-enabled N-1 ROPF can result in a remarkable reduction in computing time while retaining the solution quality. Several variations of GNN-based ML models are also implemented as benchmark to demonstrate effectiveness of the proposed AHGNN approach. Case studies prove the proposed AHGNN and the associated N-1 ROPF are highly effective in reducing computation time while preserving solution quality, highlighting the promising potential of ML, particularly GNN in enhancing power system operations. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2401.05915 [pdf, other]

Neural Implicit Surface Reconstruction of Freehand 3D Ultrasound Volume with Geometric Constraints

Authors: Hongbo Chen, Logiraj Kumaralingam, Shuhang Zhang, Sheng Song, Fayi Zhang, Haibin Zhang, Thanh-Tu Pham, Edmond H. M. Lou, Kumaradevan Punithakumar, Yuyao Zhang, Lawrence H. Le, Rui Zheng

Abstract: Three-dimensional (3D) freehand ultrasound (US) is a widely used imaging modality that allows non-invasive imaging of medical anatomy without radiation exposure. Surface reconstruction of US volume is vital to acquire the accurate anatomical structures needed for modeling, registration, and visualization. However, traditional methods cannot produce a high-quality surface due to image noise. Despit… ▽ More Three-dimensional (3D) freehand ultrasound (US) is a widely used imaging modality that allows non-invasive imaging of medical anatomy without radiation exposure. Surface reconstruction of US volume is vital to acquire the accurate anatomical structures needed for modeling, registration, and visualization. However, traditional methods cannot produce a high-quality surface due to image noise. Despite improvements in smoothness, continuity, and resolution from deep learning approaches, research on surface reconstruction in freehand 3D US is still limited. This study introduces FUNSR, a self-supervised neural implicit surface reconstruction method to learn signed distance functions (SDFs) from US volumes. In particular, FUNSR iteratively learns the SDFs by moving the 3D queries sampled around volumetric point clouds to approximate the surface, guided by two novel geometric constraints: sign consistency constraint and onsurface constraint with adversarial learning. Our approach has been thoroughly evaluated across four datasets to demonstrate its adaptability to various anatomical structures, including a hip phantom dataset, two vascular datasets and one publicly available prostate dataset. We also show that smooth and continuous representations greatly enhance the visual appearance of US data. Furthermore, we highlight the potential of our method to improve segmentation performance, and its robustness to noise distribution and motion perturbation. △ Less

Submitted 11 July, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Preprint

arXiv:2312.12587 [pdf, other]

Real-Time Diagnostic Integrity Meets Efficiency: A Novel Platform-Agnostic Architecture for Physiological Signal Compression

Authors: Neel R Vora, Amir Hajighasemi, Cody T. Reynolds, Amirmohammad Radmehr, Mohamed Mohamed, Jillur Rahman Saurav, Abdul Aziz, Jai Prakash Veerla, Mohammad S Nasr, Hayden Lotspeich, Partha Sai Guttikonda, Thuong Pham, Aarti Darji, Parisa Boodaghi Malidarreh, Helen H Shang, Jay Harvey, Kan Ding, Phuc Nguyen, Jacob M Luber

Abstract: Head-based signals such as EEG, EMG, EOG, and ECG collected by wearable systems will play a pivotal role in clinical diagnosis, monitoring, and treatment of important brain disorder diseases. However, the real-time transmission of the significant corpus physiological signals over extended periods consumes substantial power and time, limiting the viability of battery-dependent physiological monit… ▽ More Head-based signals such as EEG, EMG, EOG, and ECG collected by wearable systems will play a pivotal role in clinical diagnosis, monitoring, and treatment of important brain disorder diseases. However, the real-time transmission of the significant corpus physiological signals over extended periods consumes substantial power and time, limiting the viability of battery-dependent physiological monitoring wearables. This paper presents a novel deep-learning framework employing a variational autoencoder (VAE) for physiological signal compression to reduce wearables' computational complexity and energy consumption. Our approach achieves an impressive compression ratio of 1:293 specifically for spectrogram data, surpassing state-of-the-art compression techniques such as JPEG2000, H.264, Direct Cosine Transform (DCT), and Huffman Encoding, which do not excel in handling physiological signals. We validate the efficacy of the compressed algorithms using collected physiological signals from real patients in the Hospital and deploy the solution on commonly used embedded AI chips (i.e., ARM Cortex V8 and Jetson Nano). The proposed framework achieves a 91% seizure detection accuracy using XGBoost, confirming the approach's reliability, practicality, and scalability. △ Less

Submitted 4 January, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.09422 [pdf, other]

Joint Alignment of Multivariate Quasi-Periodic Functional Data Using Deep Learning

Authors: Vi Thanh Pham, Jonas Bille Nielsen, Klaus Fuglsang Kofoed, Jørgen Tobias Kühl, Andreas Kryger Jensen

Abstract: The joint alignment of multivariate functional data plays an important role in various fields such as signal processing, neuroscience and medicine, including the statistical analysis of data from wearable devices. Traditional methods often ignore the phase variability and instead focus on the variability in the observed amplitude. We present a novel method for joint alignment of multivariate quasi… ▽ More The joint alignment of multivariate functional data plays an important role in various fields such as signal processing, neuroscience and medicine, including the statistical analysis of data from wearable devices. Traditional methods often ignore the phase variability and instead focus on the variability in the observed amplitude. We present a novel method for joint alignment of multivariate quasi-periodic functions using deep neural networks, decomposing, but retaining all the information in the data by preserving both phase and amplitude variability. Our proposed neural network uses a special activation of the output that builds on the unit simplex transformation, and we utilize a loss function based on the Fisher-Rao metric to train our model. Furthermore, our method is unsupervised and can provide an optimal common template function as well as subject-specific templates. We demonstrate our method on two simulated datasets and one real example, comprising data from 12-lead 10s electrocardiogram recordings. △ Less

Submitted 14 November, 2023; originally announced December 2023.

Comments: 28 pages, 6 figures

arXiv:2312.03196 [pdf, other]

Domain Invariant Representation Learning and Sleep Dynamics Modeling for Automatic Sleep Staging

Authors: Seungyeon Lee, Thai-Hoang Pham, Zhao Cheng, Ping Zhang

Abstract: Sleep staging has become a critical task in diagnosing and treating sleep disorders to prevent sleep related diseases. With growing large scale sleep databases, significant progress has been made toward automatic sleep staging. However, previous studies face critical problems in sleep studies; the heterogeneity of subjects' physiological signals, the inability to extract meaningful information fro… ▽ More Sleep staging has become a critical task in diagnosing and treating sleep disorders to prevent sleep related diseases. With growing large scale sleep databases, significant progress has been made toward automatic sleep staging. However, previous studies face critical problems in sleep studies; the heterogeneity of subjects' physiological signals, the inability to extract meaningful information from unlabeled data to improve predictive performances, the difficulty in modeling correlations between sleep stages, and the lack of an effective mechanism to quantify predictive uncertainty. In this study, we propose a neural network based sleep staging model, DREAM, to learn domain generalized representations from physiological signals and models sleep dynamics. DREAM learns sleep related and subject invariant representations from diverse subjects' sleep signals and models sleep dynamics by capturing interactions between sequential signal segments and between sleep stages. We conducted a comprehensive empirical study to demonstrate the superiority of DREAM, including sleep stage prediction experiments, a case study, the usage of unlabeled data, and uncertainty. Notably, the case study validates DREAM's ability to learn generalized decision function for new subjects, especially in case there are differences between testing and training subjects. Uncertainty quantification shows that DREAM provides prediction uncertainty, making the model reliable and helping sleep experts in real world applications. △ Less

Submitted 9 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.18508 [pdf, other]

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

Authors: Axi Niu, Kang Zhang, Joshua Tian Jin Tee, Trung X. Pham, Jinqiu Sun, Chang D. Yoo, In So Kweon, Yanning Zhang

Abstract: It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To… ▽ More It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To address this problem, we propose a simple but non-travel diffusion-style data augmentation scheme for current GAN-based SR methods, known as DifAugGAN. It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training motivated by the successes of data augmentation schemes in the field to achieve good calibration. Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance. Extensive experimental evaluations demonstrate the superiority of DifAugGAN over state-of-the-art GAN-based SISR methods across both synthetic and real-world datasets, showcasing notable advancements in both qualitative and quantitative results. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.11096 [pdf, other]

On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation

Authors: Duy Minh Ho Nguyen, Tan Ngoc Pham, Nghiem Tuong Diep, Nghi Quoc Phan, Quang Pham, Vinh Tong, Binh T. Nguyen, Ngan Hoang Le, Nhat Ho, Pengtao Xie, Daniel Sonntag, Mathias Niepert

Abstract: Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for… ▽ More Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on developing better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Advances in Neural Information Processing Systems (NeurIPS) 2023, Workshop on robustness of zero/few-shot learning in foundation models

arXiv:2310.09998 [pdf, other]

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

Authors: Tan-Hanh Pham, Xianqi Li, Kim-Doang Nguyen

Abstract: Automated medical image segmentation is becoming increasingly crucial to modern clinical practice, driven by the growing demand for precise diagnosis, the push towards personalized treatment plans, and the advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNN) have been prevalent among these methods, the remarka… ▽ More Automated medical image segmentation is becoming increasingly crucial to modern clinical practice, driven by the growing demand for precise diagnosis, the push towards personalized treatment plans, and the advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNN) have been prevalent among these methods, the remarkable potential of Transformer-based models for computer vision tasks is gaining more acknowledgment. To harness the advantages of both CNN-based and Transformer-based models, we propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation. In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images, then the maps are propagated into a bridge layer, which is introduced to sequentially connect the UNet and the Transformer. In this stage, we approach the pixel-level embedding technique without position embedding vectors, aiming to make the model more efficient. Moreover, we apply spatial-reduction attention in the Transformer to reduce the computational/memory overhead. By leveraging the UNet architecture and the self-attention mechanism, our model not only retains the preservation of both local and global context information but also is capable of capturing long-range dependencies between input elements. The proposed model is extensively experimented on seven medical image segmentation datasets including polyp segmentation to demonstrate its efficacy. Comparison with several state-of-the-art segmentation models on these datasets shows the superior performance of our proposed seUNet-Trans network. △ Less

Submitted 10 November, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2309.15483 [pdf, ps, other]

Energy-Efficient Precoding Designs for Multi-User Visible Light Communication Systems with Confidential Messages

Authors: Son T. Duong, Thanh V. Pham, Chuyen T. Nguyen, Anh T. Pham

Abstract: This paper studies energy-efficient precoding designs for multi-user visible light communication (VLC) systems from the perspective of physical layer security where users' messages must be kept mutually confidential. For such systems, we first derive a lower bound on the achievable secrecy rate of each user. Next, the total power consumption for illumination and data transmission is thoroughly ana… ▽ More This paper studies energy-efficient precoding designs for multi-user visible light communication (VLC) systems from the perspective of physical layer security where users' messages must be kept mutually confidential. For such systems, we first derive a lower bound on the achievable secrecy rate of each user. Next, the total power consumption for illumination and data transmission is thoroughly analyzed. We then tackle the problem of maximizing energy efficiency, given that each user's secrecy rate satisfies a certain threshold. The design problem is shown to be non-convex fractional programming, which renders finding the optimal solution computationally prohibitive. Our aim in this paper is, therefore, to find sub-optimal yet low complexity solutions. For this purpose, the traditional Dinkelbach algorithm is first employed to reformulate the original problem to a non-fractional parameterized one. Two different approaches based on the convex-concave procedure (CCCP) and Semidefinite Relaxation (SDR) are utilized to solve the non-convex parameterized problem. In addition, to further reduce the complexity, we investigate a design using the zero-forcing (ZF) technique. Numerical results are conducted to show the feasibility, convergence, and performance of the proposed algorithms depending on different parameters of the system. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.14636 [pdf, ps, other]

Design of Energy-Efficient Artificial Noise for Physical Layer Security in Visible Light Communications

Authors: Thanh V. Pham, Anh T. Pham, Susumu Ishihara

Abstract: This paper studies the design of energy-efficient artificial noise (AN) schemes in the context of physical layer security in visible light communications (VLC). Two different transmission schemes termed $\textit{selective AN-aided single-input single-output (SISO)}$ and $\textit{AN-aided multiple-input single-output (MISO)}$ are examined and compared in terms of secrecy energy efficiency (SEE). In… ▽ More This paper studies the design of energy-efficient artificial noise (AN) schemes in the context of physical layer security in visible light communications (VLC). Two different transmission schemes termed $\textit{selective AN-aided single-input single-output (SISO)}$ and $\textit{AN-aided multiple-input single-output (MISO)}$ are examined and compared in terms of secrecy energy efficiency (SEE). In the former, the closest LED luminaire to the legitimate user (Bob) is the information-bearing signal's transmitter. At the same time, the rest of the luminaries act as jammers transmitting AN to degrade the channels of eavesdroppers (Eves). In the latter, the information-bearing signal and AN are combined and transmitted by all luminaries. When Eves' CSI is unknown, an indirect design to improve the SEE is formulated by maximizing Bob's channel's energy efficiency. A low-complexity design based on the zero-forcing criterion is also proposed. In the case of known Eves' CSI, we study the design that maximizes the minimum SEE among those corresponding to all eavesdroppers. At their respective optimal SEEs, simulation results reveal that when Eves' CSI is unknown, the selective AN-aided SISO transmission can archive twice better SEE as the AN-aided MISO does. In contrast, when Eves' CSI is known, the AN-aided MISO outperforms by 30%. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2307.09261 [pdf, other]

Optical Diffraction Tomography Meets Fluorescence Localization Microscopy

Authors: Thanh-An Pham, Emmanuel Soubies, Ferréol Soulez, Michael Unser

Abstract: We show that structural information can be extracted from single molecule localization microscopy (SMLM) data. More precisely, we reinterpret SMLM data as the measures of a phaseless optical diffraction tomography system for which the illumination sources are fluorophores within the sample. Building upon this model, we propose a joint optimization framework to estimate both the refractive index ma… ▽ More We show that structural information can be extracted from single molecule localization microscopy (SMLM) data. More precisely, we reinterpret SMLM data as the measures of a phaseless optical diffraction tomography system for which the illumination sources are fluorophores within the sample. Building upon this model, we propose a joint optimization framework to estimate both the refractive index map and the position of fluorescent molecules from the sole SMLM frames. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: Presented in ISCS23

Report number: ISCS-11

arXiv:2307.02043 [pdf, other]

A Mini-Batch Quasi-Newton Proximal Method for Constrained Total-Variation Nonlinear Image Reconstruction

Authors: Tao Hong, Thanh-an Pham, Irad Yavneh, Michael Unser

Abstract: Over the years, computational imaging with accurate nonlinear physical models has drawn considerable interest due to its ability to achieve high-quality reconstructions. However, such nonlinear models are computationally demanding. A popular choice for solving the corresponding inverse problems is accelerated stochastic proximal methods (ASPMs), with the caveat that each iteration is expensive. To… ▽ More Over the years, computational imaging with accurate nonlinear physical models has drawn considerable interest due to its ability to achieve high-quality reconstructions. However, such nonlinear models are computationally demanding. A popular choice for solving the corresponding inverse problems is accelerated stochastic proximal methods (ASPMs), with the caveat that each iteration is expensive. To overcome this issue, we propose a mini-batch quasi-Newton proximal method (BQNPM) tailored to image-reconstruction problems with total-variation regularization. It involves an efficient approach that computes a weighted proximal mapping at a cost similar to that of the proximal mapping in ASPMs. However, BQNPM requires fewer iterations than ASPMs to converge. We assess the performance of BQNPM on three-dimensional inverse-scattering problems with linear and nonlinear physical models. Our results on simulated and real data show the effectiveness and efficiency of BQNPM, △ Less

Submitted 16 August, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: 12 Pages,12 Figures, 2 Tables

arXiv:2305.19709 [pdf, other]

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

Authors: Linh The Nguyen, Thinh Pham, Dat Quoc Nguyen

Abstract: We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encod… ▽ More We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: In Proceedings of INTERSPEECH 2023 (to appear)

arXiv:2305.19353 [pdf, other]

Bearing-Constrained Leader-Follower Formation of Single-Integrators with Disturbance Rejection: Adaptive Variable-Structure Approaches

Authors: Thanh Truong Nguyen, Dung Van Vu, Tuynh Van Pham, Minh Hoang Trinh

Abstract: This paper studies the problem of stabilizing a leader-follower formation specified by a set of bearing constraints and being disturbed by some unknown uniformly bounded disturbance{s}. A set of leaders are positioned at their desired positions, while each follower is modeled by a single integrator with an additive time-varying disturbance. Adaptive variable-structure control laws using displaceme… ▽ More This paper studies the problem of stabilizing a leader-follower formation specified by a set of bearing constraints and being disturbed by some unknown uniformly bounded disturbance{s}. A set of leaders are positioned at their desired positions, while each follower is modeled by a single integrator with an additive time-varying disturbance. Adaptive variable-structure control laws using displacements or only bearing vectors are provided to stabilize the desired formation. Thanks to the adaptive mechanisms, the proposed control laws require neither information of the bearing Laplacian nor the disturbances' directions and upper bounds. It is further proved that when the leaders are moving with a same bounded uniformly continuous velocity, the moving target formation can still be achieved under the proposed control laws. Simulation results are also given to support the stability analysis. △ Less

Submitted 5 June, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: 19 pages, 6 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2303.12337 [pdf, other]

Music-Driven Group Choreography

Authors: Nhat Le, Thang Pham, Tuong Do, Erman Tjiputra, Quang D. Tran, Anh Nguyen

Abstract: Music-driven choreography is a challenging problem with a wide variety of industrial applications. Recently, many methods have been proposed to synthesize dance motions from music for a single dancer. However, generating dance motion for a group remains an open problem. In this paper, we present $\rm AIOZ-GDANCE$, a new large-scale dataset for music-driven group dance generation. Unlike existing d… ▽ More Music-driven choreography is a challenging problem with a wide variety of industrial applications. Recently, many methods have been proposed to synthesize dance motions from music for a single dancer. However, generating dance motion for a group remains an open problem. In this paper, we present $\rm AIOZ-GDANCE$, a new large-scale dataset for music-driven group dance generation. Unlike existing datasets that only support single dance, our new dataset contains group dance videos, hence supporting the study of group choreography. We propose a semi-autonomous labeling method with humans in the loop to obtain the 3D ground truth for our dataset. The proposed dataset consists of 16.7 hours of paired music and 3D motion from in-the-wild videos, covering 7 dance styles and 16 music genres. We show that naively applying single dance generation technique to creating group dance motion may lead to unsatisfactory results, such as inconsistent movements and collisions between dancers. Based on our new dataset, we propose a new method that takes an input music sequence and a set of 3D positions of dancers to efficiently produce multiple group-coherent choreographies. We propose new evaluation metrics for measuring group dance quality and perform intensive experiments to demonstrate the effectiveness of our method. Our project facilitates future research on group dance generation and is available at: https://aioz-ai.github.io/AIOZ-GDANCE/ △ Less

Submitted 26 March, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: accepted in CVPR 2023

arXiv:2302.12831 [pdf, other]

CDPMSR: Conditional Diffusion Probabilistic Models for Single Image Super-Resolution

Authors: Axi Niu, Kang Zhang, Trung X. Pham, Jinqiu Sun, Yu Zhu, In So Kweon, Yanning Zhang

Abstract: Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation to generate high-quality images. Prior attempts at applying the DPM to image super-resolution (SR) have shown that iteratively refining a pure Gaussian noise with a conditional image using a U-Net trained on denoising at various-level noises can help obtain a satisfied high-resolution image for the low-reso… ▽ More Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation to generate high-quality images. Prior attempts at applying the DPM to image super-resolution (SR) have shown that iteratively refining a pure Gaussian noise with a conditional image using a U-Net trained on denoising at various-level noises can help obtain a satisfied high-resolution image for the low-resolution one. To further improve the performance and simplify current DPM-based super-resolution methods, we propose a simple but non-trivial DPM-based super-resolution post-process framework,i.e., cDPMSR. After applying a pre-trained SR model on the to-be-test LR image to provide the conditional input, we adapt the standard DPM to conduct conditional image generation and perform super-resolution through a deterministic iterative denoising process. Our method surpasses prior attempts on both qualitative and quantitative results and can generate more photo-realistic counterparts for the low-resolution images with various benchmark datasets including Set5, Set14, Urban100, BSD100, and Manga109. Code will be published after accepted. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: 4 pages, 4 figures

arXiv:2302.11125 [pdf, ps, other]

On the Design of Artificial Noise for Physical Layer Security in Visible Light Communication Channels with Clipping

Authors: Thanh V. Pham, Steve Hranilovic, Susumu Ishihara

Abstract: Though visible light communication (VLC) systems are contained to a given room, improving their security is an important criterion in any practical deployment. In this paper, the design of artificial noise (AN) to enhance physical layer security in VLC systems is studied in the context of input signals with no explicit amplitude constraint (e.g., multicarrier systems). In such systems, clipping is… ▽ More Though visible light communication (VLC) systems are contained to a given room, improving their security is an important criterion in any practical deployment. In this paper, the design of artificial noise (AN) to enhance physical layer security in VLC systems is studied in the context of input signals with no explicit amplitude constraint (e.g., multicarrier systems). In such systems, clipping is needed to constrain the input signals within the limited linear ranges of the LEDs. However, this clipping process gives rise to non-linear clipping distortion, which must be incorporated into the AN design. To facilitate the solution of this problem, a sub-optimal design approach is presented using the Charnes-Cooper transformation and the convex-concave procedure (CCP). Then, a novel AN transmission scheme is proposed to reduce the impact of clipping distortion, thus improving the secrecy performance. The proposed scheme exploits the common structure of LED luminaries that they are often composed of several light-emitting chips. Capitalizing on this property, LED chips in each luminaire are divided into two groups driven by separate driver circuits. One group is used to transmit the information-bearing signal, while the other group transmits the AN. Numerical results show that the clipping distortion significantly reduces the secrecy level, and using AN is advantageous over the no-AN scheme in improving the secrecy performance. Moreover, the proposed AN transmission scheme is shown to achieve considerable secrecy improvements compared with the traditional transmission approaches (e.g., about 1 bit/s/Hz improvement in the achievable secrecy rate when the standard deviation of the LEDs' modulating current is 0.25 A and the signal-to-interference-plus-noise ratio of the eavesdropper's received signal is limited to $0$ dB). △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: arXiv admin note: text overlap with arXiv:2210.00438

arXiv:2210.15876 [pdf, ps, other]

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Authors: Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

Abstract: One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observatio… ▽ More One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length. △ Less

Submitted 25 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: 5 pages, 3 figures, 4 tables

arXiv:2206.13591 [pdf]

Reduced Optimal Power Flow Using Graph Neural Network

Authors: Thuan Pham, Xingpeng Li

Abstract: OPF problems are formulated and solved for power system operations, especially for determining generation dispatch points in real-time. For large and complex power system networks with large numbers of variables and constraints, finding the optimal solution for real-time OPF in a timely manner requires a massive amount of computing power. This paper presents a new method to reduce the number of co… ▽ More OPF problems are formulated and solved for power system operations, especially for determining generation dispatch points in real-time. For large and complex power system networks with large numbers of variables and constraints, finding the optimal solution for real-time OPF in a timely manner requires a massive amount of computing power. This paper presents a new method to reduce the number of constraints in the original OPF problem using a graph neural network (GNN). GNN is an innovative machine learning model that utilizes features from nodes, edges, and network topology to maximize its performance. In this paper, we proposed a GNN model to predict which lines would be heavily loaded or congested with given load profiles and generation capacities. Only these critical lines will be monitored in an OPF problem, creating a reduced OPF (ROPF) problem. Significant saving in computing time is expected from the proposed ROPF model. A comprehensive analysis of predictions from the GNN model was also made. It is concluded that the application of GNN for ROPF is able to reduce computing time while retaining solution quality. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: 6 pages, 16 figures, 3 tables, Submitted (under review) to 54th North American Power Symposium (NAPS 2022)

arXiv:2205.03122 [pdf]

doi 10.1364/BOE.463057

Ultrathin, high-speed, all-optical photoacoustic endomicroscopy probe for guiding minimally invasive surgery

Authors: Tianrui Zhao, Truc Thuy Pham, Christian Baker, Michelle T. Ma, Sebastien Ourselin, Tom Vercauteren, Edward Zhang, Paul C. Beard, Wenfeng Xia

Abstract: Photoacoustic (PA) endoscopy has shown significant potential for clinical diagnosis and surgical guidance. Multimode fibres (MMFs) are becoming increasing attractive for the development of miniature endoscopy probes owing to ultrathin size, low cost and diffraction-limited spatial resolution enabled by wavefront shaping. However, current MMF-based PA endomicroscopy probes are either limited by a b… ▽ More Photoacoustic (PA) endoscopy has shown significant potential for clinical diagnosis and surgical guidance. Multimode fibres (MMFs) are becoming increasing attractive for the development of miniature endoscopy probes owing to ultrathin size, low cost and diffraction-limited spatial resolution enabled by wavefront shaping. However, current MMF-based PA endomicroscopy probes are either limited by a bulky ultrasound detector or a low imaging speed which hindered their usability. In this work, we report the development of a highly miniaturised and high-speed PA endomicroscopy probe that is integrated within the cannula of a 20 gauge medical needle. This probe comprises a MMF for delivering the PA excitation light and a single-mode optical fibre with a plano-concave microresonator for ultrasound detection. Wavefront shaping with a digital micromirror device enabled rapid raster-scanning of a focused light spot at the distal end of the MMF for tissue interrogation. High-resolution PA imaging of mouse red blood cells covering an area 100 microns in diameter was achieved with the needle probe at ~3 frames per second. Mosaicing imaging was performed after fibre characterisation by translating the needle probe to enlarge the field-of-view in real-time. The developed ultrathin PA endomicroscopy probe is promising for guiding minimally invasive surgery by providing functional, molecular and microstructural information of tissue in real-time. △ Less

Submitted 6 May, 2022; originally announced May 2022.

arXiv:2203.10078 [pdf, other]

Bayesian Inversion for Nonlinear Imaging Models using Deep Generative Priors

Authors: Pakshal Bohra, Thanh-an Pham, Jonathan Dong, Michael Unser

Abstract: Most modern imaging systems incorporate a computational pipeline to infer the image of interest from acquired measurements. The Bayesian approach to solve such ill-posed inverse problems involves the characterization of the posterior distribution of the image. It depends on the model of the imaging system and on prior knowledge on the image of interest. In this work, we present a Bayesian reconstr… ▽ More Most modern imaging systems incorporate a computational pipeline to infer the image of interest from acquired measurements. The Bayesian approach to solve such ill-posed inverse problems involves the characterization of the posterior distribution of the image. It depends on the model of the imaging system and on prior knowledge on the image of interest. In this work, we present a Bayesian reconstruction framework for nonlinear imaging models where we specify the prior knowledge on the image through a deep generative model. We develop a tractable posterior-sampling scheme based on the Metropolis-adjusted Langevin algorithm for the class of nonlinear inverse problems where the forward model has a neural-network-like structure. This class includes most practical imaging modalities. We introduce the notion of augmented deep generative priors in order to suitably handle the recovery of quantitative images.We illustrate the advantages of our framework by applying it to two nonlinear imaging modalities-phase retrieval and optical diffraction tomography. △ Less

Submitted 25 May, 2023; v1 submitted 18 March, 2022; originally announced March 2022.

arXiv:2112.08418 [pdf]

doi 10.1109/GreenTech52845.2022.9772026

Neural Network-based Power Flow Model

Authors: Thuan Pham, Xingpeng Li

Abstract: Power flow analysis is used to evaluate the flow of electricity in the power system network. Power flow calculation is used to determine the steady-state variables of the system, such as the voltage magnitude/phase angle of each bus and the active/reactive power flow on each branch. The DC power flow model is a popular linear power flow model that is widely used in the power industry. Although it… ▽ More Power flow analysis is used to evaluate the flow of electricity in the power system network. Power flow calculation is used to determine the steady-state variables of the system, such as the voltage magnitude/phase angle of each bus and the active/reactive power flow on each branch. The DC power flow model is a popular linear power flow model that is widely used in the power industry. Although it is fast and robust, it may lead to inaccurate line flow results for some transmission lines. Since renewable energy sources such as solar farms or offshore wind farms are usually located far away from the main grid, accurate line flow results on these critical lines are essential for power flow analysis due to the unpredictable nature of renewable energy. Data-driven methods can be used to partially address these inaccuracies by taking advantage of historical grid profiles. In this paper, a neural network (NN) model is trained to predict power flow results using historical power system data. Although the training process may take time, once trained, it is very fast to estimate line flows. A comprehensive performance analysis between the proposed NN-based power flow model and the traditional DC power flow model is conducted. It can be concluded that the proposed NN-based power flow model can find solutions quickly and more accurately than DC power flow model. △ Less

Submitted 12 March, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Journal ref: IEEE Green Technologies Conference 2022

arXiv:2109.09026 [pdf, other]

Hybrid Data Augmentation and Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for Speech Emotion Recognition

Authors: Nhat Truong Pham, Duc Ngoc Minh Dang, Sy Dzung Nguyen

Abstract: Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the ef… ▽ More Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the effectiveness of HDA methods, a deep learning framework namely (ADCRNN) is designed by integrating deep dilated convolutional-recurrent neural networks with an attention mechanism. Besides, we choose 3D log Mel-spectrogram (MelSpec) features as the inputs for the deep learning framework. Furthermore, we reconfigure a loss function by combining a softmax loss and a center loss to classify the emotions. For validating our proposed methods, we use the EmoDB dataset that consists of several emotions with imbalanced samples. Experimental results prove that the proposed methods achieve better accuracy than the state-of-the-art methods on the EmoDB with 87.12% and 88.47% for the traditional and GAN-based methods, respectively. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: 12 pages, 16 figures, 6 tables

arXiv:2109.03219 [pdf, other]

Fruit-CoV: An Efficient Vision-based Framework for Speedy Detection and Diagnosis of SARS-CoV-2 Infections Through Recorded Cough Sounds

Authors: Long H. Nguyen, Nhat Truong Pham, Van Huong Do, Liu Tai Nguyen, Thanh Tin Nguyen, Van Dung Do, Hai Nguyen, Ngoc Duy Nguyen

Abstract: SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing serv… ▽ More SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing service of SARS-CoV-2 at home. In this study, we introduce Fruit-CoV, a two-stage vision framework, which is capable of detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, we convert sounds into Log-Mel Spectrograms and use the EfficientNet-V2 network to extract its visual features in the first stage. In the second stage, we use 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN to aggregate feature representations of the Log-Mel Spectrograms. Finally, we use the combined features to train a binary classifier. In this study, we use a dataset provided by the AICovidVN 115M Challenge, which includes a total of 7371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results show that our proposed model achieves an AUC score of 92.8% and ranks the 1st place on the leaderboard of the AICovidVN Challenge. More importantly, our proposed framework can be integrated into a call center or a VoIP system to speed up detecting SARS-CoV-2 infections through online/recorded cough sounds. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 4 pages

arXiv:2108.11089 [pdf, other]

Detecting Drill Failure in the Small Short-sound Drill Dataset

Authors: Thanh Tran, Nhat Truong Pham, Jan Lundgren

Abstract: Monitoring the conditions of machines is vital in the manufacturing industry. Early detection of faulty components in machines for stopping and repairing the failed components can minimize the downtime of the machine. This article presents an approach to detect the failure occurring in drill machines based on drill sounds from Valmet AB. The drill dataset includes three classes: anomalous sounds,… ▽ More Monitoring the conditions of machines is vital in the manufacturing industry. Early detection of faulty components in machines for stopping and repairing the failed components can minimize the downtime of the machine. This article presents an approach to detect the failure occurring in drill machines based on drill sounds from Valmet AB. The drill dataset includes three classes: anomalous sounds, normal sounds, and irrelevant sounds, which are also labeled as "Broken", "Normal", and "Other", respectively. Detecting drill failure effectively remains a challenge due to the following reasons. The waveform of drill sound is complex and short for detection. Additionally, in realistic soundscapes, there are sounds and noise in the context at the same time. Moreover, the balanced dataset is small to apply state-of-the-art deep learning techniques. To overcome these aforementioned difficulties, we augmented sounds to increase the number of sounds in the dataset. We then proposed a convolutional neural network (CNN) combined with a long short-term memory (LSTM) to extract features from log-Mel spectrograms and learn global high-level feature representation for the classification of three classes. A leaky rectified linear unit (Leaky ReLU) was utilized as the activation function for our proposed CNN instead of the rectified linear unit (ReLU). Moreover, we deployed an attention mechanism at the frame level after the LSTM layer to learn long-term global feature representations. As a result, the proposed method reached an overall accuracy of 92.35% for the drill failure detection system. △ Less

Submitted 9 November, 2021; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: 8 pages, 10 figures, journal

arXiv:2108.00475 [pdf, other]

Self-supervised Learning with Local Attention-Aware Feature

Authors: Trung X. Pham, Rusty John Lloyd Mina, Dias Issa, Chang D. Yoo

Abstract: In this work, we propose a novel methodology for self-supervised learning for generating global and local attention-aware visual features. Our approach is based on training a model to differentiate between specific image transformations of an input sample and the patched images. Utilizing this approach, the proposed method is able to outperform the previous best competitor by 1.03% on the Tiny-Ima… ▽ More In this work, we propose a novel methodology for self-supervised learning for generating global and local attention-aware visual features. Our approach is based on training a model to differentiate between specific image transformations of an input sample and the patched images. Utilizing this approach, the proposed method is able to outperform the previous best competitor by 1.03% on the Tiny-ImageNet dataset and by 2.32% on the STL-10 dataset. Furthermore, our approach outperforms the fully-supervised learning method on the STL-10 dataset. Experimental results and visualizations show the capability of successfully learning global and local attention-aware visual representations. △ Less

Submitted 1 August, 2021; originally announced August 2021.

Comments: 5 pages, 4 figures

arXiv:2107.10701 [pdf, other]

Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech

Authors: Duo Ma, Nana Hou, Van Tung Pham, Haihua Xu, Eng Siong Chng

Abstract: To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component… ▽ More To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component here doesn't need to perform pre-training and fine-tuning processes separately. Through analysis, we found that the success of the proposed method lies in the following aspects. Firstly, multitask learning is essential, that is the SE network is not only learning to produce more Intelligent speech, it is also aimed to generate speech that is beneficial to recognition. Secondly, we also found speech phase preserved from noisy speech is critical for improving ASR performance. Thirdly, we propose a dual channel data augmentation training method to obtain further improvement.Specifically, we combine the clean and enhanced speech to train the whole system. We evaluate the proposed method on the RATS English data set, achieving a relative WER reduction of 4.6% with the joint training method, and up to a relative WER reduction of 11.2% with the proposed data augmentation method. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: 7pages,3figures,Submitted to APSIPA2021

arXiv:2107.03679 [pdf, other]

Diffraction Tomography with Helmholtz Equation: Efficient and Robust Multigrid-Based Solver

Authors: Tao Hong, Thanh-an Pham, Eran Treister, Michael Unser

Abstract: Diffraction tomography is a noninvasive technique that estimates the refractive indices of unknown objects and involves an inverse-scattering problem governed by the wave equation. Recent works have shown the benefit of nonlinear models of wave propagation that account for multiple scattering and reflections. In particular, the Lippmann-Schwinger~(LiS) model defines an inverse problem to simulate… ▽ More Diffraction tomography is a noninvasive technique that estimates the refractive indices of unknown objects and involves an inverse-scattering problem governed by the wave equation. Recent works have shown the benefit of nonlinear models of wave propagation that account for multiple scattering and reflections. In particular, the Lippmann-Schwinger~(LiS) model defines an inverse problem to simulate the wave propagation. Although accurate, this model is hard to solve when the samples are highly contrasted or have a large physical size. In this work, we introduce instead a Helmholtz-based nonlinear model for inverse scattering. To solve the corresponding inverse problem, we propose a robust and efficient multigrid-based solver. Moreover, we show that our method is a suitable alternative to the LiS model, especially for strongly scattering objects. Numerical experiments on simulated and real data demonstrate the effectiveness of the Helmholtz model, as well as the efficiency of the proposed multigrid method. △ Less

Submitted 8 July, 2021; originally announced July 2021.

Comments: 12 pages,13 figures, 2 tables

arXiv:2106.02865 [pdf, ps, other]

Consensus Analysis over Clustered Networks of Multi-Agent Systems under External Disturbances

Authors: Thiem V. Pham, Quynh T. T. Nguyen

Abstract: This paper studies a consensus problem of multi-agent systems subjected to external disturbances over the clustered network. It considers that the agents are divided into several clusters. They are almost all the time isolated one from another, which has a directed spanning tree. The goal of agents achieves a common value. To support interaction between clusters with a minimum exchange of informat… ▽ More This paper studies a consensus problem of multi-agent systems subjected to external disturbances over the clustered network. It considers that the agents are divided into several clusters. They are almost all the time isolated one from another, which has a directed spanning tree. The goal of agents achieves a common value. To support interaction between clusters with a minimum exchange of information, we consider that each cluster has an agent, who can exchange information to any agents outside of its cluster at some discrete instants of time. Our main contribution proposes a consensus protocol, which takes into account the continuous-time communications among agents inside the clusters and discrete-time communication information across clusters. Accordingly, the consensus and the robust $\mathcal{H}_{\infty}$ consensus over the clustered network are respectively analyzed. Thanks to results from matrix theory and algebraic graph theory, we show that the proposed control protocols can solve the problems mentioned above. Finally, a numerical example is given to show the effectiveness of the proposed theoretical results. △ Less

Submitted 5 June, 2021; originally announced June 2021.

arXiv:2106.02822 [pdf, other]

$\mathcal{H}_2/\mathcal{H}_{-}$ Distributed Fault Detection and Isolation for Heterogeneous Multi-Agent Systems

Authors: Thiem V. Pham, Quynh T. T. Nguyen

Abstract: The paper deals with the problem of distributed fault detection and isolation (FDI) for a group of heterogeneous multi-agent systems. The developed formation for the FDI is taken into account as a distributed observer design methodology, where the interaction between the agent and its neighbors is described as a vector of distributed relative output measurements. Based on two performance indexes… ▽ More The paper deals with the problem of distributed fault detection and isolation (FDI) for a group of heterogeneous multi-agent systems. The developed formation for the FDI is taken into account as a distributed observer design methodology, where the interaction between the agent and its neighbors is described as a vector of distributed relative output measurements. Based on two performance indexes $\mathcal{H}_2$ and $\mathcal{H}_{-}$, sufficient conditions are given to ensure the residual signals robust to the disturbances and sensitive with respect to the fault signals. In addition, we show that by using our proposed approach, each agent is able to estimate both its own states and states of its nearest neighbors in the presence of disturbances and faults. Finally, numerical simulations are provided to demonstrate the effectiveness of the theoretically analyzed results. △ Less

Submitted 5 June, 2021; originally announced June 2021.

arXiv:2103.08233 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413446

Robust MAML: Prioritization task buffer with adaptive learning process for model-agnostic meta-learning

Authors: Thanh Nguyen, Tung Luu, Trung Pham, Sanzhar Rakhimkul, Chang D. Yoo

Abstract: Model agnostic meta-learning (MAML) is a popular state-of-the-art meta-learning algorithm that provides good weight initialization of a model given a variety of learning tasks. The model initialized by provided weight can be fine-tuned to an unseen task despite only using a small amount of samples and within a few adaptation steps. MAML is simple and versatile but requires costly learning rate tun… ▽ More Model agnostic meta-learning (MAML) is a popular state-of-the-art meta-learning algorithm that provides good weight initialization of a model given a variety of learning tasks. The model initialized by provided weight can be fine-tuned to an unseen task despite only using a small amount of samples and within a few adaptation steps. MAML is simple and versatile but requires costly learning rate tuning and careful design of the task distribution which affects its scalability and generalization. This paper proposes a more robust MAML based on an adaptive learning scheme and a prioritization task buffer(PTB) referred to as Robust MAML (RMAML) for improving scalability of training process and alleviating the problem of distribution mismatch. RMAML uses gradient-based hyper-parameter optimization to automatically find the optimal learning rate and uses the PTB to gradually adjust train-ing task distribution toward testing task distribution over the course of training. Experimental results on meta reinforcement learning environments demonstrate a substantial performance gain as well as being less sensitive to hyper-parameter choice and robust to distribution mismatch. △ Less

Submitted 10 June, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2102.10822 [pdf, ps, other]

Energy-Efficient Precoding for Multi-User Visible Light Communication with Confidential Messages

Authors: Son T. Duong, Thanh V. Pham, Chuyen T. Nguyen, Anh T. Pham

Abstract: In this paper, an energy-efficient precoding scheme is designed for multi-user visible light communication (VLC) systems in the context of physical layer security, where users' messages are kept mutually confidential. The design problem is shown to be non-convex fractional programming, therefore Dinkelbach algorithm and convex-concave procedure (CCCP) based on the first-order Taylor approximation… ▽ More In this paper, an energy-efficient precoding scheme is designed for multi-user visible light communication (VLC) systems in the context of physical layer security, where users' messages are kept mutually confidential. The design problem is shown to be non-convex fractional programming, therefore Dinkelbach algorithm and convex-concave procedure (CCCP) based on the first-order Taylor approximation are utilized to tackle the problem. Numerical results are performed to show the convergence behaviors and the performance of the proposed solution for different parameter settings. △ Less

Submitted 22 February, 2021; originally announced February 2021.

arXiv:2012.03422 [pdf, ps, other]

A General Conditional BER Expression of Rectangular QAM in the Presence of Phase Noise

Authors: Thanh V. Pham, Thang V. Nguyen, Anh T. Pham

Abstract: In this paper, we newly present a closed-form bit-error rate (BER) expression for an $M$-ary pulse-amplitude modulation ($M$-PAM) over additive white Gaussian noise (AWGN) channels by analytically characterizing the bit decision regions and positions. The obtained expression is then used to derive the conditional BER of a rectangular quadrature amplitude modulation (QAM) for a given value of phase… ▽ More In this paper, we newly present a closed-form bit-error rate (BER) expression for an $M$-ary pulse-amplitude modulation ($M$-PAM) over additive white Gaussian noise (AWGN) channels by analytically characterizing the bit decision regions and positions. The obtained expression is then used to derive the conditional BER of a rectangular quadrature amplitude modulation (QAM) for a given value of phase noise. Numerical results show that the impact of phase noise on the conditional BER performance is proportional to the constellation size. Moreover, it is observed that given a constellation size, the square QAM achieves the lowest phase noise-induced performance loss compared to other rectangular constellations. △ Less

Submitted 4 January, 2021; v1 submitted 6 December, 2020; originally announced December 2020.

arXiv:2010.13423 [pdf, ps, other]

Optimal-transport-based metric for SMLM

Authors: Quentin Denoyelle, Thanh-an Pham, Pol del Aguila Pla, Daniel Sage, Michael Unser

Abstract: We propose the use of Flat Metric to assess the performance of reconstruction methods for single-molecule localization microscopy (SMLM) in scenarios where the ground-truth is available. Flat Metric is intimately related to the concept of optimal transport between measures of different mass, providing solid mathematical foundations for SMLM evaluation and integrating both localization and detectio… ▽ More We propose the use of Flat Metric to assess the performance of reconstruction methods for single-molecule localization microscopy (SMLM) in scenarios where the ground-truth is available. Flat Metric is intimately related to the concept of optimal transport between measures of different mass, providing solid mathematical foundations for SMLM evaluation and integrating both localization and detection performance. In this paper, we provide the foundations of Flat Metric and validate this measure by applying it to controlled synthetic examples and to data from the SMLM 2016 Challenge. △ Less

Submitted 6 February, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: Accepted to the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI 2021) 5 pages, 4 figures

MSC Class: 90C08; 49Q22; 92C55; 94A12 ACM Class: I.4; J.3; J.2; I.m

arXiv:2010.12143 [pdf, other]

Enriching Under-Represented Named-Entities To Improve Speech Recognition Performance

Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Aishan Wumaier, Eng Siong Chng

Abstract: Automatic speech recognition (ASR) for under-represented named-entity (UR-NE) is challenging due to such named-entities (NE) have insufficient instances and poor contextual coverage in the training data to learn reliable estimates and representations. In this paper, we propose approaches to enriching UR-NEs to improve speech recognition performance. Specifically, our first priority is to ensure th… ▽ More Automatic speech recognition (ASR) for under-represented named-entity (UR-NE) is challenging due to such named-entities (NE) have insufficient instances and poor contextual coverage in the training data to learn reliable estimates and representations. In this paper, we propose approaches to enriching UR-NEs to improve speech recognition performance. Specifically, our first priority is to ensure those UR-NEs to appear in the word lattice if there is any. To this end, we make exemplar utterances for those UR-NEs according to their categories (e.g. location, person, organization, etc.), ending up with an improved language model (LM) that boosts the UR-NE occurrence in the word lattice. With more UR-NEs appearing in the lattice, we then boost the recognition performance through lattice rescoring methods. We first enrich the representations of UR-NEs in a pre-trained recurrent neural network LM (RNNLM) by borrowing the embedding representations of the rich-represented NEs (RR-NEs), yielding the lattices that statistically favor the UR-NEs. Finally, we directly boost the likelihood scores of the utterances containing UR-NEs and gain further performance improvement. △ Less

Submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.00355 [pdf, other]

doi 10.23919/ACC50511.2021.9483255

Distributed two-time-scale methods over clustered networks

Authors: Thiem V. Pham, Thinh T. Doan, Dinh Hoa Nguyen

Abstract: In this paper, we consider consensus problems over a network of nodes, where the network is divided into a number of clusters. We are interested in the case where the communication topology within each cluster is dense as compared to the sparse communication across the clusters. Moreover, each cluster has one leader which can communicate with other leaders in different clusters. The goal of the no… ▽ More In this paper, we consider consensus problems over a network of nodes, where the network is divided into a number of clusters. We are interested in the case where the communication topology within each cluster is dense as compared to the sparse communication across the clusters. Moreover, each cluster has one leader which can communicate with other leaders in different clusters. The goal of the nodes is to agree at some common value under the presence of communication delays across the clusters. Our main contribution is to propose a novel distributed two-time-scale consensus algorithm, which pertains to the separation in network topology of clustered networks. In particular, one scale is to model the dynamic of the agents in each cluster, which is much faster (due to the dense communication) than the scale describing the slowly aggregated evolution between the clusters (due to the sparse communication). We prove the convergence of the proposed method in the presence of uniform, but possibly arbitrarily large, communication delays between the leaders. In addition, we provided an explicit formula for the convergence rate of such algorithm, which characterizes the impact of delays and the network topology. Our results shows that after a transient time characterized by the topology of each cluster, the convergence of the two-time-scale consensus method only depends on the connectivity of the leaders. Finally, we validate our theoretical results by a number of numerical simulations on different clustered networks. △ Less

Submitted 1 October, 2020; originally announced October 2020.

arXiv:2009.11554 [pdf, other]

doi 10.1109/TIP.2021.3099956

Robust Phase Unwrapping via Deep Image Prior for Quantitative Phase Imaging

Authors: Fangshu Yang, Thanh-an Pham, Nathalie Brandenberg, Matthias P. Lutolf, Jianwei Ma, Michael Unser

Abstract: Quantitative phase imaging (QPI) is an emerging label-free technique that produces images containing morphological and dynamical information without contrast agents. Unfortunately, the phase is wrapped in most imaging system. Phase unwrapping is the computational process that recovers a more informative image. It is particularly challenging with thick and complex samples such as organoids. Recent… ▽ More Quantitative phase imaging (QPI) is an emerging label-free technique that produces images containing morphological and dynamical information without contrast agents. Unfortunately, the phase is wrapped in most imaging system. Phase unwrapping is the computational process that recovers a more informative image. It is particularly challenging with thick and complex samples such as organoids. Recent works that rely on supervised training show that deep learning is a powerful method to unwrap the phase; however, supervised approaches require large and representative datasets which are difficult to obtain for complex biological samples. Inspired by the concept of deep image priors, we propose a deep-learning-based method that does not need any training set. Our framework relies on an untrained convolutional neural network to accurately unwrap the phase while ensuring the consistency of the measurements. We experimentally demonstrate that the proposed method faithfully recovers the phase of complex samples on both real and simulated data. Our work paves the way to reliable phase imaging of thick and complex samples with QPI. △ Less

Submitted 24 September, 2020; originally announced September 2020.

arXiv:2006.07094 [pdf, other]

Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-switching Speech Recognition

Authors: Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng

Abstract: In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found… ▽ More In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found using the overall data yields worse results, and hence data selection study is necessary. Then to exploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not. However, Mandarin data only helps on those utterances that are significantly Mandarin-dominated. Besides, there is a balance point, over which more monolingual data will divert the CSSR system, degrading results. Finally, we analyze the effectiveness of combining monolingual data to train a CSSR system with the HMM-DNN hybrid framework. The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data. △ Less

Submitted 13 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 5 pages, conference, Accepted by Interspeech2020

arXiv:2005.10407 [pdf, other]

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Authors: Zhiping Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures. △ Less

Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:2005.08742 [pdf, ps, other]

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng

Abstract: In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to dr… ▽ More In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to drop the necessity of phonetic models in hybrid ASR. We study it under different settings and demonstrate its effectiveness in dealing with underrepresented NEs. Next, we study the impact of neural language model (LM) with letter-based features derived to handle infrequent words. After that, we attempt to enrich representations of underrepresented NEs in pretrained neural LM by borrowing the embedding representations of rich-represented words. This let us gain significant performance improvement on underrepresented NE recognition. Finally, we boost the likelihood scores of utterances containing NEs in the word lattices rescored by neural LMs and gain further performance improvement. The combination of the aforementioned approaches improves NE recognition by up to 42% relatively. △ Less

Submitted 18 May, 2020; originally announced May 2020.

arXiv:2005.07068 [pdf]

Recognition of 26 Degrees of Freedom of Hands Using Model-based approach and Depth-Color Images

Authors: Cong Hoang Quach, Minh Trien Pham, Anh Viet Dang, Dinh Tuan Pham, Thuan Hoang Tran, Manh Duong Phung

Abstract: In this study, we present an model-based approach to recognize full 26 degrees of freedom of a human hand. Input data include RGB-D images acquired from a Kinect camera and a 3D model of the hand constructed from its anatomy and graphical matrices. A cost function is then defined so that its minimum value is achieved when the model and observation images are matched. To solve the optimization prob… ▽ More In this study, we present an model-based approach to recognize full 26 degrees of freedom of a human hand. Input data include RGB-D images acquired from a Kinect camera and a 3D model of the hand constructed from its anatomy and graphical matrices. A cost function is then defined so that its minimum value is achieved when the model and observation images are matched. To solve the optimization problem in 26 dimensional space, the particle swarm optimization algorimth with improvements are used. In addition, parallel computation in graphical processing units (GPU) is utilized to handle computationally expensive tasks. Simulation and experimental results show that the system can recognize 26 degrees of freedom of hands with the processing time of 0.8 seconds per frame. The algorithm is robust to noise and the hardware requirement is simple with a single camera. △ Less

Submitted 13 May, 2020; originally announced May 2020.

Comments: in Proceedings of the 2014 National Conference on Electronics, Communications and Information Technology (REV-ECIT). in Vietnamese language

arXiv:2003.02597

AI outperformed every dermatologist: Improved dermoscopic melanoma diagnosis through customizing batch logic and loss function in an optimized Deep CNN architecture

Authors: Cong Tri Pham, Mai Chi Luong, Dung Van Hoang, Antoine Doucet

Abstract: Melanoma, one of most dangerous types of skin cancer, re-sults in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent research has used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, an imbalance of sensitivity and specificity measures affected the pe… ▽ More Melanoma, one of most dangerous types of skin cancer, re-sults in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent research has used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, an imbalance of sensitivity and specificity measures affected the performance of existing models. This study proposes a method using deep convolutional neural networks aiming to detect melanoma as a binary classification problem. It involves 3 key features, namely customized batch logic, customized loss function and reformed fully connected layers. The training dataset is kept up to date including 17,302 images of melanoma and nevus; this is the largest dataset by far. The model performance is compared to that of 157 dermatologists from 12 university hospitals in Germany based on MClass-D dataset. The model outperformed all 157 dermatologists and achieved state-of-the-art performance with AUC at 94.4% with sensitivity of 85.0% and specificity of 95.0% using a prediction threshold of 0.5 on the MClass-D dataset of 100 dermoscopic images. Moreover, a threshold of 0.40858 showed the most balanced measure compared to other researches, and is promisingly application to medical diagnosis, with sensitivity of 90.0% and specificity of 93.8%. △ Less

Submitted 28 August, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: We are submitting the article in the journal and waiting for the review result, so we want to temporarily delete the article. When the article is officially accepted, it will be resubmitted

Showing 1–50 of 64 results for author: Pham, T