Search | arXiv e-print repository

SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement

Authors: Linlin Hu, Ao Sun, Shijie Hao, Richang Hong, Meng Wang

Abstract: Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-v… ▽ More Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-view disparities and enable interaction between the left and right views, leading to improved performance. However, these methods still do not fully exploit the interaction between left and right view information. To address this issue, we propose a model called Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement (SDI-Net). The backbone structure of SDI-Net is two encoder-decoder pairs, which are used to learn the mapping function from low-light images to normal-light images. Among the encoders and the decoders, we design a module named Cross-View Sufficient Interaction Module (CSIM), aiming to fully exploit the correlations between the binocular views via the attention mechanism. The quantitative and visual results on public datasets validate the superiority of our method over other related methods. Ablation studies also demonstrate the effectiveness of the key elements in our model. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2406.09869 [pdf, ps, other]

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

Authors: Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

Abstract: Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav… ▽ More Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units have shown effectiveness compared to spectral features, they still lag behind continuous SSL representations. In this work, we propose MMM, a multi-layer multi-residual multi-stream discrete units extraction method from SSL. Specifically, we introduce iterative residual vector quantization with K-means for different layers in an SSL model to extract multi-stream speech discrete representation. Through extensive experiments in speech recognition, speech resynthesis, and text-to-speech, we demonstrate the proposed MMM can surpass or on-par with neural codec's performance under various conditions. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech2024

arXiv:2401.07342 [pdf, other]

Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms

Authors: Anchen Sun, Juan J Londono, Batya Elbaum, Luis Estrada, Roberto Jose Lazo, Laura Vitale, Hugo Gonzalez Villasanti, Riccardo Fusaroli, Lynn K Perry, Daniel S Messinger

Abstract: Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both t… ▽ More Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both to classify speakers (ALICE) and to transcribe their utterances (Whisper). We compare results from our framework to those from a human expert for 110 minutes of classroom recordings, including 85 minutes from child-word microphones (n=4 children) and 25 minutes from teacher-worn microphones (n=2 teachers). The overall proportion of agreement, that is, the proportion of correctly classified teacher and child utterances, was .76, with an error-corrected kappa of .50 and a weighted F1 of .76. The word error rate for both teacher and child transcriptions was .15, meaning that 15% of words would need to be deleted, added, or changed to equate the Whisper and expert transcriptions. Moreover, speech features such as the mean length of utterances in words, the proportion of teacher and child utterances that were questions, and the proportion of utterances that were responded to within 2.5 seconds were similar when calculated separately from expert and automated transcriptions. The results suggest substantial progress in analyzing classroom speech that may support children's language development. Future research using natural language processing is under way to improve speaker classification and to analyze results from the application of the automated framework to a larger dataset containing classroom recordings from 13 children and 3 teachers observed on 17 occasions over one year. △ Less

Submitted 10 April, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

Comments: 8 pages, 4 figures, 3 tables, The paper has been accepted to 2024 IEEE International Conference on Development and Learning (ICDL) as a full oral presentation and will appear in the IEEE ICDL proceedings

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2310.02720 [pdf, other]

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Authors: Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

Abstract: Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that l… ▽ More Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB). △ Less

Submitted 30 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted at ICLR2024 as spotlight

arXiv:2309.08837 [pdf, other]

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Authors: Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao

Abstract: This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and i… ▽ More This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

arXiv:2308.03240 [pdf, other]

Carbon-Aware Optimal Power Flow

Authors: Xin Chen, Andy Sun, Wenbo Shi, Na Li

Abstract: To facilitate effective decarbonization of the electric power sector, this paper introduces the generic Carbon-aware Optimal Power Flow (C-OPF) method for power system decision-making that considers demand-side carbon accounting and emission management. Built upon the classic optimal power flow (OPF) model, the C-OPF method incorporates carbon emission flow equations and constraints, as well as ca… ▽ More To facilitate effective decarbonization of the electric power sector, this paper introduces the generic Carbon-aware Optimal Power Flow (C-OPF) method for power system decision-making that considers demand-side carbon accounting and emission management. Built upon the classic optimal power flow (OPF) model, the C-OPF method incorporates carbon emission flow equations and constraints, as well as carbon-related objectives, to jointly optimize power flow and carbon flow. In particular, this paper establishes the feasibility and solution uniqueness of the carbon emission flow equations, and proposes modeling and linearization techniques to address the issues of undetermined power flow directions and bilinear terms in the C-OPF model. Additionally, two novel carbon emission models, together with the carbon accounting schemes, for energy storage systems are developed and integrated into the C-OPF model. Numerical simulations demonstrate the characteristics and effectiveness of the C-OPF method, in comparison with OPF solutions. △ Less

Submitted 17 July, 2024; v1 submitted 6 August, 2023; originally announced August 2023.

arXiv:2305.03101 [pdf, other]

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

Authors: Yun Tang, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden D. Tomasello, Juan Pino

Abstract: Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new… ▽ More Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AED's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write scenarios and creates a matched environment for streaming applications. We evaluate the proposed approach on the \textsc{MuST-C} dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks. In the streaming case, TAED outperforms Transducer in the ASR task and one ST direction while comparable results are achieved in another translation direction. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

arXiv:2304.11783 [pdf, other]

Rip Current Detection in Nearshore Areas through UAV Video Analysis with Almost Local-Isometric Embedding Techniques on Sphere

Authors: Anchen Sun, Kaiqi Yang

Abstract: Rip currents pose a significant danger to those who visit beaches, as they can swiftly pull swimmers away from shore. Detecting these currents currently relies on costly equipment and is challenging to implement on a larger scale. The advent of unmanned aerial vehicles (UAVs) and camera technology, however, has made monitoring near-shore regions more accessible and scalable. This paper proposes a… ▽ More Rip currents pose a significant danger to those who visit beaches, as they can swiftly pull swimmers away from shore. Detecting these currents currently relies on costly equipment and is challenging to implement on a larger scale. The advent of unmanned aerial vehicles (UAVs) and camera technology, however, has made monitoring near-shore regions more accessible and scalable. This paper proposes a new framework for detecting rip currents using video-based methods that leverage optical flow estimation, offshore direction calculation, earth camera projection with almost local-isometric embedding on the sphere, and temporal data fusion techniques. Through the analysis of videos from multiple beaches, including Palm Beach, Haulover, Ocean Reef Park, and South Beach, as well as YouTube footage, we demonstrate the efficacy of our approach, which aligns with human experts' annotations. △ Less

Submitted 20 February, 2024; v1 submitted 23 April, 2023; originally announced April 2023.

Comments: 10 pages, 9 figures, 3 tables

arXiv:2304.11547 [pdf, other]

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Authors: Jianzong Wang, Xulong Zhang, Haobin Tang, Aolan Sun, Ning Cheng, Jing Xiao

Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits… ▽ More In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)

arXiv:2210.00493 [pdf, other]

doi 10.1088/1361-6560/acc9a1

Accelerated partial separable model using dimension-reduced optimization technique for ultra-fast cardiac MRI

Authors: Zhongsen Li, Aiqi Sun, Chuyu Liu, Haining Wei, Shuai Wang, Mingzhu Fu, Rui Li

Abstract: Objective. Imaging dynamic object with high temporal resolution is challenging in magnetic resonance imaging (MRI). Partial separable (PS) model was proposed to improve the imaging quality by reducing the degrees of freedom of the inverse problem. However, PS model still suffers from long acquisition time and even longer reconstruction time. The main objective of this study is to accelerate the PS… ▽ More Objective. Imaging dynamic object with high temporal resolution is challenging in magnetic resonance imaging (MRI). Partial separable (PS) model was proposed to improve the imaging quality by reducing the degrees of freedom of the inverse problem. However, PS model still suffers from long acquisition time and even longer reconstruction time. The main objective of this study is to accelerate the PS model, shorten the time required for acquisition and reconstruction, and maintain good image quality simultaneously. Approach. We proposed to fully exploit the dimension reduction property of the PS model, which means implementing the optimization algorithm in subspace. We optimized the data consistency term, and used a Tikhonov regularization term based on the Frobenius norm of temporal difference. The proposed dimension-reduced optimization technique was validated in free-running cardiac MRI. We have performed both retrospective experiments on public dataset and prospective experiments on in-vivo data. The proposed method was compared with four competing algorithms based on PS model, and two non-PS model methods. Main results. The proposed method has robust performance against shortened acquisition time or suboptimal hyper-parameter settings, and achieves superior image quality over all other competing algorithms. The proposed method is 20-fold faster than the widely accepted PS+Sparse method, enabling image reconstruction to be finished in just a few seconds. Significance. Accelerated PS model has the potential to save much time for clinical dynamic MRI examination, and is promising for real-time MRI applications. △ Less

Submitted 1 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: 23 pages, 11 figures. Accepted as manuscript on Physics in Medicine & Biology

arXiv:2208.08757 [pdf, other]

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Authors: SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the va… ▽ More One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: 5 pages,5 figures,INTERSPEECH 2022

arXiv:2207.11309 [pdf, other]

Impacts of Dynamic Line Ratings on the ERCOT Transmission System

Authors: Thomas Lee, Vineet Jagadeesan Nair, Andy Sun

Abstract: Grid regulators and participants are paying increasing attention to Dynamic Line Ratings (DLR) as a new approach to address transmission system bottlenecks. In this paper, a thorough comparison of DLR, Ambient Adjusted Ratings (AAR), and the traditional Static Line Ratings (SLR) are conducted on a synthetic ERCOT grid. Estimates of DLR and AAR are calculated using an equation based on heat balance… ▽ More Grid regulators and participants are paying increasing attention to Dynamic Line Ratings (DLR) as a new approach to address transmission system bottlenecks. In this paper, a thorough comparison of DLR, Ambient Adjusted Ratings (AAR), and the traditional Static Line Ratings (SLR) are conducted on a synthetic ERCOT grid. Estimates of DLR and AAR are calculated using an equation based on heat balance physics, along with high-resolution weather data of temperature and wind velocities. A constraint generation method for contingency screening is developed for solving security-constrained optimal power flow. Numerical results suggest that employing DLR could double the benefits compared to those of AAR relative to SLR, in terms of system costs, renewable curtailment, and emissions. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: 6 pages, 8 figures

arXiv:2103.15333 [pdf, other]

doi 10.24251/HICSS.2021.386

A Distributed Scheme for Stability Assessment in Large-Scale Structure-Preserving Models via Singular Perturbation

Authors: Amin Gholami, Xu Andy Sun

Abstract: Assessing small-signal stability of power systems composed of thousands of interacting generators is a computationally challenging task. To reduce the computational burden, this paper introduces a novel condition to assess and certify small-signal stability. Using this certificate, we can see the impact of network topology and system parameters (generators' damping and inertia) on the eigenvalues… ▽ More Assessing small-signal stability of power systems composed of thousands of interacting generators is a computationally challenging task. To reduce the computational burden, this paper introduces a novel condition to assess and certify small-signal stability. Using this certificate, we can see the impact of network topology and system parameters (generators' damping and inertia) on the eigenvalues of the system. The proposed certificate is derived from rigorous analysis of the classical structure-preserving swing equation model and has a physically insightful interpretation related to the generators' parameters and reactive power. To develop the certificate, we use singular perturbation techniques, and in the process, we establish the relationship between the structure-preserving model and its singular perturbation counterpart. As the proposed method is fully distributed and uses only local measurements, its computational cost does not increase with the size of the system. The effectiveness of the scheme is numerically illustrated on the WSCC system. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: https://hdl.handle.net/10125/71001

Journal ref: Proceedings of the 54th Hawaii International Conference on System Sciences, 2021

arXiv:2103.15308 [pdf, other]

Stability of Multi-Microgrids: New Certificates, Distributed Control, and Braess's Paradox

Authors: Amin Gholami, Xu Andy Sun

Abstract: This paper investigates the theory of resilience and stability in multi-microgrid networks. We derive new sufficient conditions to guarantee small-signal stability of multi-microgrids in both lossless and lossy networks. The new stability certificate for lossy networks only requires local information, thus leads to a fully distributed control scheme. Moreover, we study the impact of network topolo… ▽ More This paper investigates the theory of resilience and stability in multi-microgrid networks. We derive new sufficient conditions to guarantee small-signal stability of multi-microgrids in both lossless and lossy networks. The new stability certificate for lossy networks only requires local information, thus leads to a fully distributed control scheme. Moreover, we study the impact of network topology, interface parameters (virtual inertia and damping), and local measurements (voltage magnitude and reactive power) on the stability of the system. The proposed stability certificate suggests the existence of Braess's Paradox in the stability of multi-microgrids, i.e. adding more connections between microgrids could worsen the multi-microgrid system stability as a whole. We also extend the presented analysis to structure-preserving network models, and provide a stability certificate as a function of original network parameters, instead of the Kron reduced network parameters. We provide a detailed numerical study of the proposed certificate, the distributed control scheme, and a coordinated control approach with line switching. The simulation shows the effectiveness of the proposed stability conditions and control schemes in a four-microgrid network, IEEE 33-bus system, and several large-scale synthetic grids. △ Less

Submitted 28 March, 2021; originally announced March 2021.

arXiv:2012.02626 [pdf, other]

GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis

Authors: Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, Lingwei Kong, Jing Xiao

Abstract: This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis, intending to parse the semantic and syntactic relationship of input sequences in a graphical domain for improving the prosody performance. The nodes of the graph embedding are formed by prosodic words, and the edges are formed by the other prosodic boundaries, namely pro… ▽ More This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis, intending to parse the semantic and syntactic relationship of input sequences in a graphical domain for improving the prosody performance. The nodes of the graph embedding are formed by prosodic words, and the edges are formed by the other prosodic boundaries, namely prosodic phrase boundary (PPH) and intonation phrase boundary (IPH). Different Graph Neural Networks (GNN) like Gated Graph Neural Network (GGNN) and Graph Long Short-term Memory (G-LSTM) are utilised as graph encoders to exploit the graphical prosody boundary information. Graph-to-sequence model is proposed and formed by a graph encoder and an attentional decoder. Two techniques are proposed to embed sequential information into the graph-to-sequence text-to-speech model. The experimental results show that this proposed approach can encode the phonetic and prosody rhythm of an utterance. The mean opinion score (MOS) of these GNN models shows comparative results with the state-of-the-art sequence-to-sequence models with better performance in the aspect of prosody. This provides an alternative approach for prosody modelling in end-to-end speech synthesis. △ Less

Submitted 2 December, 2020; originally announced December 2020.

Comments: Accepted to SLT 2021

arXiv:2010.06662 [pdf, other]

doi 10.1137/20M1370392

The Impact of Damping in Second-Order Dynamical Systems with Applications to Power Grid Stability

Authors: Amin Gholami, X. Andy Sun

Abstract: We consider a broad class of second-order dynamical systems and study the impact of damping as a system parameter on the stability, hyperbolicity, and bifurcation in such systems. We prove a monotonic effect of damping on the hyperbolicity of the equilibrium points of the corresponding first-order system. This provides a rigorous formulation and theoretical justification for the intuitive notion t… ▽ More We consider a broad class of second-order dynamical systems and study the impact of damping as a system parameter on the stability, hyperbolicity, and bifurcation in such systems. We prove a monotonic effect of damping on the hyperbolicity of the equilibrium points of the corresponding first-order system. This provides a rigorous formulation and theoretical justification for the intuitive notion that damping increases stability. To establish this result, we prove a matrix perturbation result for complex symmetric matrices with positive semidefinite perturbations to their imaginary parts, which may be of independent interest. Furthermore, we establish necessary and sufficient conditions for the breakdown of hyperbolicity of the first-order system under damping variations in terms of observability of a pair of matrices relating damping, inertia, and Jacobian matrices, and propose sufficient conditions for Hopf bifurcation resulting from such hyperbolicity breakdown. The developed theory has significant applications in the stability of electric power systems, which are one of the most complex and important engineering systems. In particular, we characterize the impact of damping on the hyperbolicity of the swing equation model which is the fundamental dynamical model of power systems, and demonstrate Hopf bifurcations resulting from damping variations. △ Less

Submitted 19 July, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Journal ref: SIAM Journal on Applied Dynamical Systems 21 (2022) 405-437

arXiv:2008.02263 [pdf, other]

doi 10.1109/CDC42340.2020.9304077

A Fast Certificate for Power System Small-Signal Stability

Authors: Amin Gholami, Xu Andy Sun

Abstract: Swing equations are an integral part of a large class of power system dynamical models used in rotor angle stability assessment. Despite intensive studies, some fundamental properties of lossy swing equations are still not fully understood. In this paper, we develop a sufficient condition for certifying the stability of equilibrium points (EPs) of these equations, and illustrate the effects of dam… ▽ More Swing equations are an integral part of a large class of power system dynamical models used in rotor angle stability assessment. Despite intensive studies, some fundamental properties of lossy swing equations are still not fully understood. In this paper, we develop a sufficient condition for certifying the stability of equilibrium points (EPs) of these equations, and illustrate the effects of damping, inertia, and network topology on the stability properties of such EPs. The proposed certificate is suitable for real-time monitoring and fast stability assessment, as it is purely algebraic and can be evaluated in a parallel manner. Moreover, we provide a novel approach to quantitatively measure the degree of stability in power grids using the proposed certificate. Extensive computational experiments are conducted, demonstrating the practicality and effectiveness of the proposal. △ Less

Submitted 5 August, 2020; originally announced August 2020.

Journal ref: 2020 59th IEEE Conference on Decision and Control (CDC)

arXiv:2003.01924 [pdf, other]

GraphTTS: graph-to-sequence modelling in neural text-to-speech

Authors: Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, Jing Xiao

Abstract: This paper leverages the graph-to-sequence method in neural text-to-speech (GraphTTS), which maps the graph embedding of the input sequence to spectrograms. The graphical inputs consist of node and edge representations constructed from input texts. The encoding of these graphical inputs incorporates syntax information by a GNN encoder module. Besides, applying the encoder of GraphTTS as a graph au… ▽ More This paper leverages the graph-to-sequence method in neural text-to-speech (GraphTTS), which maps the graph embedding of the input sequence to spectrograms. The graphical inputs consist of node and edge representations constructed from input texts. The encoding of these graphical inputs incorporates syntax information by a GNN encoder module. Besides, applying the encoder of GraphTTS as a graph auxiliary encoder (GAE) can analyse prosody information from the semantic structure of texts. This can remove the manual selection of reference audios process and makes prosody modelling an end-to-end procedure. Experimental analysis shows that GraphTTS outperforms the state-of-the-art sequence-to-sequence models by 0.24 in Mean Opinion Score (MOS). GAE can adjust the pause, ventilation and tones of synthesised audios automatically. This experimental conclusion may give some inspiration to researchers working on improving speech synthesis prosody. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Comments: Accepted to ICASSP 2020

arXiv:1904.08855 [pdf, ps, other]

Solvability of Power Flow Equations Through Existence and Uniqueness of Complex Fixed Point

Authors: Bai Cui, Xu Andy Sun

Abstract: Variations of loading level and changes in system topological property may cause the operating point of an electric power systems to move gradually towards the verge of its transmission capability, which can lead to catastrophic outcomes such as voltage collapse blackout. From a modeling perspective, voltage collapse is closely related to the solvability of power flow equations. Determining condit… ▽ More Variations of loading level and changes in system topological property may cause the operating point of an electric power systems to move gradually towards the verge of its transmission capability, which can lead to catastrophic outcomes such as voltage collapse blackout. From a modeling perspective, voltage collapse is closely related to the solvability of power flow equations. Determining conditions for existence and uniqueness of solution to power flow equations is one of the fundamental problems in power systems that has great theoretical and practical significance. In this paper, we provide strong sufficient condition certifying the existence and uniqueness of power flow solutions in a subset of state (voltage) space. The novel analytical approach heavily exploits the contractive properties of the fixed-point form in complex domain, which leads to much sharper analytical conditions than previous ones based primarily on analysis in the real domain. Extensive computational experiments are performed which validate the correctness and demonstrate the effectiveness of the proposed condition. △ Less

Submitted 18 April, 2019; originally announced April 2019.

Showing 1–20 of 20 results for author: Sun, A