Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Kuan, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.09886  [pdf, other

    eess.AS cs.CL cs.SD

    Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

    Authors: Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee

    Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible ag… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: 8 pages, 2 figures

  2. arXiv:2407.06957  [pdf, other

    eess.AS cs.CL cs.CY

    Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models

    Authors: Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-yi Lee

    Abstract: Speech Integrated Large Language Models (SILLMs) combine large language models with speech perception to perform diverse tasks, such as emotion recognition to speaker verification, demonstrating universal audio understanding capability. However, these models may amplify biases present in training data, potentially leading to biased access to information for marginalized groups. This work introduce… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  3. arXiv:2406.08402  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

    Authors: Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

    Abstract: Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2401.00273  [pdf, ps, other

    eess.AS cs.CL

    Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

    Authors: Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-yi Lee

    Abstract: This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: Submitted to ICASSP 2024 Self-supervision in Audio, Speech and Beyond workshop

  5. arXiv:2309.14324  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Towards General-Purpose Text-Instruction-Guided Voice Conversion

    Authors: Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee

    Abstract: This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to determine the attributes of the converted speech, our model adds versatility and specificity to voice conversion. The proposed VC model is a neural codec language mo… ▽ More

    Submitted 16 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  6. arXiv:2309.09510  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

    Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

    Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: To appear in the proceedings of ICASSP 2024