Zum Hauptinhalt springen

Showing 1–50 of 211 results for author: Shi, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2408.16132  [pdf, other

    eess.AS cs.MM cs.SD

    SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

    Abstract: With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  2. arXiv:2408.14262  [pdf

    cs.CL cs.SD eess.AS

    Self-supervised Speech Representations Still Struggle with African American Vernacular English

    Authors: Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

    Abstract: Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American Eng… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: INTERSPEECH 2024

  3. arXiv:2408.13495  [pdf

    eess.IV cs.CV

    Topological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images

    Authors: Tianxiang Huang, Jing Shi, Ge Jin, Juncheng Li, Jun Wang, Jun Du, Jun Shi

    Abstract: The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  4. arXiv:2408.01127  [pdf, other

    eess.SY

    Relax, Estimate, and Track: a Simple Battery State-of-charge and State-of-health Estimation Method

    Authors: Shida Jiang, Junzhe Shi, Scott Moura

    Abstract: Battery management stands as a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimation are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have high computational complexity, and lack theoretical ana… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  5. arXiv:2407.21395  [pdf, other

    eess.IV

    HINER: Neural Representation for Hyperspectral Image

    Authors: Junqi Shi, Mingyi Jiang, Ming Lu, Tong Chen, Xun Cao, Zhan Ma

    Abstract: This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angl… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: ACM MM24

  6. arXiv:2407.14140  [pdf, other

    eess.SP

    A Secure and Efficient Distributed Semantic Communication System for Heterogeneous Internet of Things Devices

    Authors: Weihao Zeng, Xinyu Xu, Qianyun Zhang, Jiting Shi, Zhijin Qin, Zhenyu Guan

    Abstract: Semantic communications have emerged as a promising solution to address the challenge of efficient communication in rapidly evolving and increasingly complex Internet of Things (IoT) networks. However, protecting the security of semantic communication systems within the distributed and heterogeneous IoT networks is critical issues that need to be addressed. We develop a secure and efficient distri… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  7. arXiv:2407.05717  [pdf

    eess.SY cs.RO eess.SP

    A New Framework for Nonlinear Kalman Filters

    Authors: Shida Jiang, Junzhe Shi, Scott Moura

    Abstract: The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems. A… ▽ More

    Submitted 27 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: An equation was missing in Fig. S3 and S4 previously. This typo is now fixed

  8. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  9. arXiv:2407.03886  [pdf, other

    cs.CV eess.IV

    DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment

    Authors: Jinsong Shi, Pan Gao, Xiaojiang Peng, Jie Qin

    Abstract: Image quality assessment (IQA) has long been a fundamental challenge in image understanding. In recent years, deep learning-based IQA methods have shown promising performance. However, the lack of large amounts of labeled data in the IQA field has hindered further advancements in these methods. This paper introduces DSMix, a novel data augmentation technique specifically designed for IQA tasks, ai… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  10. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

    Comments: Updated affiliations; 20 pages

  11. arXiv:2406.10911  [pdf, other

    cs.SD eess.AS

    SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

    Authors: Yuxun Tang, Jiatong Shi, Yuning Wu, Qin Jin

    Abstract: In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction m… ▽ More

    Submitted 20 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  12. arXiv:2406.09869  [pdf, ps, other

    cs.SD eess.AS

    MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

    Authors: Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

    Abstract: Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech2024

  13. arXiv:2406.08905  [pdf, other

    cs.SD eess.AS

    SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

    Authors: Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

    Abstract: Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation th… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  14. arXiv:2406.08761  [pdf, other

    cs.SD eess.AS

    VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

    Authors: Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe

    Abstract: Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures

  15. arXiv:2406.08641  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

    Authors: Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

    Abstract: ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  16. arXiv:2406.08416  [pdf, other

    cs.SD eess.AS

    TokSing: Singing Voice Synthesis based on Discrete Tokens

    Authors: Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

    Abstract: Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody… ▽ More

    Submitted 20 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  17. arXiv:2406.07725  [pdf, ps, other

    cs.SD eess.AS

    The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

    Authors: Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

    Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: This manuscript has been accepted by Interspeech2024

  18. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  19. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  20. Hydrogen and Battery Based Energy Storage System (ESS) for Future DC Microgrids

    Authors: Massiagbe Diabate, Timothy Vriend, Harish S Krishnamoorthy, Jian Shi

    Abstract: In this paper, a hydrogen-based energy storage system (ESS) is proposed for DC microgrids, which can potentially be integrated with battery ESS to meet the needs of future grids with high renewable penetration. Hydrogen-based ESS can provide a stable energy supply for a long time but has a slower response than battery ESSs. However, a combination of battery and hydrogen storage provides stable ene… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: A 5-pages Digest paper summarizes in detail the work done

  21. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  22. arXiv:2405.03482  [pdf

    eess.SY

    Managing Renewable Energy Resources Using Equity-Market Risk Tools - the Efficient Frontiers

    Authors: Haim Grebel, Divya Vikas, Jim Shi

    Abstract: The energy market, and specifically the renewable sector carries volatility and risks, similar to the financial market. Here, we leverage on a well-established, return-risk approach, commonly used by equity portfolio-managers and apply it to energy resources. We visualize the relationship between the resources' costs and their risks in terms of efficient frontiers. We apply this analysis to public… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 9 pages, 3 figures, 10 ref

  23. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  24. arXiv:2403.07938  [pdf, other

    cs.SD cs.AI cs.CV cs.LG cs.MM eess.AS

    Text-to-Audio Generation Synchronized with Videos

    Authors: Shentong Mo, Jing Shi, Yapeng Tian

    Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.12903

  25. arXiv:2402.10505  [pdf, other

    eess.SY math.OC

    A Survey of Resilient Coordination for Cyber-Physical Systems Against Malicious Attacks

    Authors: Zirui Liao, Jian Shi, Yuwei Zhang, Shaoping Wang, Zhiyong Sun

    Abstract: Cyber-physical systems (CPSs) facilitate the integration of physical entities and cyber infrastructures through the utilization of pervasive computational resources and communication units, leading to improved efficiency, automation, and practical viability in both academia and industry. Due to its openness and distributed characteristics, a critical issue prevalent in CPSs is to guarantee resilie… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 35 pages, 7 figures, 5 tables

  26. arXiv:2402.02724  [pdf, other

    eess.IV cs.CV cs.LG

    FDNet: Frequency Domain Denoising Network For Cell Segmentation in Astrocytes Derived From Induced Pluripotent Stem Cells

    Authors: Haoran Li, Jiahua Shi, Huaming Chen, Bo Du, Simon Maksour, Gabrielle Phillips, Mirella Dottori, Jun Shen

    Abstract: Artificially generated induced pluripotent stem cells (iPSCs) from somatic cells play an important role for disease modeling and drug screening of neurodegenerative diseases. Astrocytes differentiated from iPSCs are important targets to investigate neuronal metabolism. The astrocyte differentiation progress can be monitored through the variations of morphology observed from microscopy images at di… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted by The IEEE International Symposium on Biomedical Imaging (ISBI) 2024

  27. arXiv:2401.17619  [pdf, ps, other

    cs.SD eess.AS

    Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

    Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

    Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatu… ▽ More

    Submitted 12 June, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech2024

  28. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  29. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 26 August, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  30. arXiv:2312.16998  [pdf, other

    eess.IV cs.CV

    Deep Unfolding Network with Spatial Alignment for multi-modal MRI reconstruction

    Authors: Hao Zhang, Qi Wang, Jun Shi, Shihui Ying, Zhijie Wen

    Abstract: Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic information, but some modalities are limited by the long scanning time. To accelerate the whole acquisition process, MRI reconstruction of one modality from highly undersampled k-space data with another fully-sampled reference modality is an efficient solution. However, the misalignment between modalities, which is common… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  31. arXiv:2312.15424  [pdf, other

    eess.SY

    Integrating Renewable Energy Sources as Reserve Providers: Modeling, Pricing, and Properties

    Authors: Wenli Wu, Ye Guo, Jiantao Shi

    Abstract: In pursuit of carbon neutrality, many countries have adopted renewable portfolio standards to facilitate the integration of renewable energy. However, increasing penetration of renewable energy resources will also pose higher requirements on system flexibility. Allowing renewable themselves to participate in the reserve market could be a viable solution. To this end, this paper proposes an optimal… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  32. arXiv:2312.06995  [pdf, other

    cs.CV eess.IV

    Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

    Authors: Jinsong Shi, Pan Gao, Jie Qin

    Abstract: Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic d… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI24

  33. arXiv:2312.06668  [pdf

    cs.CL cs.SD eess.AS

    Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

    Authors: Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi

    Abstract: Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted to ASRU 2023

  34. arXiv:2312.06466  [pdf, other

    cs.SD eess.AS

    Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach

    Authors: Yan Zhao, Yuan Zong, Hailun Lian, Cheng Lu, Jingang Shi, Wenming Zheng

    Abstract: Cross-corpus speech emotion recognition (SER) poses a challenge due to feature distribution mismatch, potentially degrading the performance of established SER methods. In this paper, we tackle this challenge by proposing a novel transfer subspace learning method called acoustic knowledgeguided transfer linear regression (AKTLR). Unlike existing approaches, which often overlook domain-specific know… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  35. arXiv:2312.03376  [pdf, other

    eess.SY

    Beacon-enabled TDMA Ultraviolet Communication Network System Design and Realization

    Authors: Yuchen Pan, Fei Long, Ping Li, Haotian Shi, Jiazhao Shi, Hanlin Xiao, Chen Gong, Zhengyuan Xu

    Abstract: Nonline of sight (NLOS) ultraviolet (UV) scattering communication can serve as a good candidate for outdoor optical wireless communication (OWC) in the cases of non-perfect transmitter-receiver alignment and radio silence. We design and demonstrate a NLOS UV scattering communication network system in this paper, where a beacon-enabled time division multiple access (TDMA) scheme is adopted. In our… ▽ More

    Submitted 15 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

  36. arXiv:2310.13208  [pdf

    eess.SY

    Online energy management system for a fuel cell/battery hybrid system with multiple fuel cell stacks

    Authors: Junzhe Shi, Ulf Jakob Flø Aarsnes, Dagfinn Nærheim, Scott Moura

    Abstract: In recent years, fuel cell/battery hybrid systems have attracted substantial attention due to their high energy density and low emissions. The online energy management system (EMS) is essential for these hybrid systems, tasked with controlling the energy flow and ensuring optimal system performance, encompassing fuel efficiency and mitigating fuel cell and battery degradation. This research propos… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

  37. arXiv:2310.05513  [pdf, other

    cs.SD cs.CL eess.AS

    Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

    Authors: Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

    Abstract: The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track w… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU

  38. arXiv:2310.05369  [pdf, other

    cs.SD eess.AS

    AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification

    Authors: Li Wang, Jiaqi Li, Yuhao Luo, Jiahao Zheng, Lei Wang, Hao Li, Ke Xu, Chengfang Fang, Jie Shi, Zhizheng Wu

    Abstract: It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, w… ▽ More

    Submitted 16 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP2024

  39. arXiv:2310.03938  [pdf, other

    cs.SD eess.AS

    EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios

    Authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

    Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a no… ▽ More

    Submitted 5 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures, 3 tables

  40. arXiv:2310.02720  [pdf, other

    cs.SD eess.AS

    Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

    Authors: Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

    Abstract: Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that l… ▽ More

    Submitted 30 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at ICLR2024 as spotlight

  41. arXiv:2310.00704  [pdf, other

    cs.SD eess.AS

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation

    Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

    Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More

    Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  42. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  43. arXiv:2309.15317  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

    Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More

    Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  44. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  45. arXiv:2309.13755  [pdf, other

    eess.SY

    Efficient Recursive Data-enabled Predictive Control (Extended Version)

    Authors: Jicheng Shi, Yingzhao Lian, Colin N. Jones

    Abstract: In the field of model predictive control, Data-enabled Predictive Control (DeePC) offers direct predictive control, bypassing traditional modeling. However, challenges emerge with increased computational demand due to recursive data updates. This paper introduces a novel recursive updating algorithm for DeePC. It emphasizes the use of Singular Value Decomposition (SVD) for efficient low-dimensiona… ▽ More

    Submitted 24 March, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

  46. arXiv:2309.09776  [pdf, other

    eess.IV

    MAD: Meta Adversarial Defense Benchmark

    Authors: X. Peng, D. Zhou, G. Sun, J. Shi, L. Wu

    Abstract: Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 12 pages, 11 figures,IEEE Transactions on Neural Networks and Learning Systems

  47. arXiv:2309.09510  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

    Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

    Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: To appear in the proceedings of ICASSP 2024

  48. arXiv:2309.00494  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-stage Deep Learning Artifact Reduction for Computed Tomography

    Authors: Jiayang Shi, Daniel M. Pelt, K. Joost Batenburg

    Abstract: In Computed Tomography (CT), an image of the interior structure of an object is computed from a set of acquired projection images. The quality of these reconstructed images is essential for accurate analysis, but this quality can be degraded by a variety of imaging artifacts. To improve reconstruction quality, the acquired projection images are often processed by a pipeline consisting of multiple… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  49. arXiv:2308.04112  [pdf, other

    eess.SY

    Multi-Interval Rolling-Window Joint Dispatch and Pricing of Energy and Reserve under Uncertainty

    Authors: Jiantao Shi, Ye Guo, Wenchuan Wu, Hongbin Sun

    Abstract: In this paper, the intra-day multi-interval rolling-window joint dispatch and pricing of energy and reserve is studied under increasing volatile and uncertain renewable generations. A look-ahead energy-reserve co-optimization model is proposed for the rolling-window dispatch, where possible contingencies and load/renewable forecast errors over the look-ahead window are modeled as several scenario… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  50. arXiv:2308.02867  [pdf, other

    cs.SD eess.AS

    A Systematic Exploration of Joint-training for Singing Voice Synthesis

    Authors: Yuning Wu, Yifeng Yu, Jiatong Shi, Tao Qian, Qin Jin

    Abstract: There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.