Zum Hauptinhalt springen

Showing 1–16 of 16 results for author: An, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  2. arXiv:2312.14860  [pdf, other

    cs.SD eess.AS

    Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

    Authors: Lingyun Zuo, Keyu An, Shiliang Zhang, Zhijie Yan

    Abstract: In a speech recognition system, voice activity detection (VAD) is a crucial frontend module. Addressing the issues of poor noise robustness in traditional binary VAD systems based on DFSMN, the paper further proposes semantic VAD based on multi-task learning with improved models for real-time and offline systems, to meet specific application requirements. Evaluations on internal datasets show that… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  3. arXiv:2309.14758  [pdf, other

    eess.AS cs.SD

    Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

    Authors: Keyu An, Shiliang Zhang

    Abstract: Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transfor… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP 2024

  4. arXiv:2305.11571  [pdf, other

    eess.AS

    BAT: Boundary aware transducer for memory-efficient and low-latency ASR

    Authors: Keyu An, Xian Shi, Shiliang Zhang

    Abstract: Recently, recurrent neural network transducer (RNN-T) gains increasing popularity due to its natural streaming capability as well as superior performance. Nevertheless, RNN-T training requires large time and computation resources as RNN-T loss calculation is slow and consumes a lot of memory. Another limitation of RNN-T is that it tends to access more contexts for better performance, thus leading… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: accepted into INTERSPEECH2023

  5. arXiv:2203.16776  [pdf, ps, other

    eess.AS cs.CL cs.LG

    An Empirical Study of Language Model Integration for Transducer based Speech Recognition

    Authors: Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

    Abstract: Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract th… ▽ More

    Submitted 3 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted into INTERSPEECH 2022

  6. arXiv:2203.16758  [pdf, other

    eess.AS cs.CL

    CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

    Authors: Keyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, Guanglu Wan

    Abstract: History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, with… ▽ More

    Submitted 2 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted into INTERSPEECH 2022

  7. arXiv:2203.16757  [pdf, ps, other

    eess.AS cs.CL

    Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

    Authors: Keyu An, Ji Xiao, Zhijian Ou

    Abstract: Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which usually consists of a beamforming front-end and a recognition back-end. However, the end-to-end training becomes more difficult due to the integration of multiple modules, particularly considering that multi-channel speech data recorded in real environments are limited in size. This raises the deman… ▽ More

    Submitted 8 October, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted by ISCSLP 2022. arXiv admin note: substantial text overlap with arXiv:2107.02670

  8. arXiv:2107.05038  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings

    Authors: Chengrui Zhu, Keyu An, Huahuan Zheng, Zhijian Ou

    Abstract: The use of phonological features (PFs) potentially allows language-specific phones to remain linked in training, which is highly desirable for information sharing for multilingual and crosslingual speech recognition methods for low-resourced languages. A drawback suffered by previous methods in using phonological features is that the acoustic-to-PF extraction in a bottom-up way is itself difficult… ▽ More

    Submitted 30 October, 2021; v1 submitted 11 July, 2021; originally announced July 2021.

    Comments: ASRU2021

  9. arXiv:2107.02670  [pdf, other

    eess.AS

    Exploiting Single-Channel Speech For Multi-channel End-to-end Speech Recognition

    Authors: Keyu An, Zhijian Ou

    Abstract: Recently, the end-to-end training approach for neural beamformer-supported multi-channel ASR has shown its effectiveness in multi-channel speech recognition. However, the integration of multiple modules makes it more difficult to perform end-to-end training, particularly given that the multi-channel speech corpus recorded in real environments with a sizeable data scale is relatively limited. This… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

    Comments: submitted to ASRU 2021

  10. arXiv:2104.14791  [pdf, other

    eess.AS cs.CL cs.SD

    Deformable TDNN with adaptive receptive fields for speech recognition

    Authors: Keyu An, Yi Zhang, Zhijian Ou

    Abstract: Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based hybrid speech recognition systems and recent end-to-end systems. Nevertheless, the receptive fields of TDNNs are limited and fixed, which is not desirable for tasks like speech recognition, where the temporal dynamics of speech are varied and affected by many factors. This paper proposes to use deformable TDNNs for adaptive t… ▽ More

    Submitted 30 April, 2021; originally announced April 2021.

    Comments: 5 pages. submitted to Interspeech 2021

  11. arXiv:2011.06724  [pdf, other

    cs.SD eess.AS

    The SLT 2021 children speech recognition challenge: Open datasets, rules and baselines

    Authors: Fan Yu, Zhuoyuan Yao, Xiong Wang, Keyu An, Lei Xie, Zhijian Ou, Bo Liu, Xiulin Li, Guanqiong Miao

    Abstract: Automatic speech recognition (ASR) has been significantly advanced with the use of deep learning and big data. However improving robustness, including achieving equally good performance on diverse speakers and accents, is still a challenging problem. In particular, the performance of children speech recognition (CSR) still lags behind due to 1) the speech and language characteristics of children's… ▽ More

    Submitted 16 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: 7 pages, 3 figures, 3 tables

  12. arXiv:2011.05649  [pdf, other

    eess.AS cs.LG

    Efficient Neural Architecture Search for End-to-end Speech Recognition via Straight-Through Gradients

    Authors: Huahuan Zheng, Keyu An, Zhijian Ou

    Abstract: Neural Architecture Search (NAS), the process of automating architecture engineering, is an appealing next step to advancing end-to-end Automatic Speech Recognition (ASR), replacing expert-designed networks with learned, task-specific architectures. In contrast to early computational-demanding NAS methods, recent gradient-based NAS methods, e.g., DARTS (Differentiable ARchiTecture Search), SNAS (S… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: Accepted by IEEE SLT 2021

  13. arXiv:2005.13326  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency

    Authors: Keyu An, Hongyu Xiang, Zhijian Ou

    Abstract: In this paper, we present a new open source toolkit for speech recognition, named CAT (CTC-CRF based ASR Toolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks. Experiments show CAT obtains state-of-the-art r… ▽ More

    Submitted 4 August, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

    Comments: Accepted into INTERSPEECH 2020. arXiv admin note: text overlap with arXiv:1911.08747

  14. arXiv:2002.00529  [pdf, ps, other

    eess.SP

    Genetic Algorithm Optimized Support Vector Machine in NOMA-Based Satellite Networks with Imperfect CSI

    Authors: Xiaojuan Yan, Kang An, Cheng-Xiang Wang, Wei-Ping Zhu, Yusheng Li, Zhiqiang Feng

    Abstract: With the help of a power-domain non-orthogonal multiple access (NOMA) scheme, satellite networks can simultaneously serve multiple users within limited time/spectrum resource block. However, the existence of channel estimation errors inevitably degrade the judgment on users' channel state information (CSI) accuracy, thus affecting the user pairing processing and suppressing the superiority of the… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

  15. arXiv:1911.08747  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    CAT: CRF-based ASR Toolkit

    Authors: Keyu An, Hongyu Xiang, Zhijian Ou

    Abstract: In this paper, we present a new open source toolkit for automatic speech recognition (ASR), named CAT (CRF-based ASR Toolkit). A key feature of CAT is discriminative training in the framework of conditional random field (CRF), particularly with connectionist temporal classification (CTC) inspired state topology. CAT contains a full-fledged implementation of CTC-CRF and provides a complete workflow… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: Code released at: https://github.com/thu-spmi/cat

  16. arXiv:1906.00891  [pdf, other

    cs.CV eess.SY

    Automated Steel Bar Counting and Center Localization with Convolutional Neural Networks

    Authors: Zhun Fan, Jiewei Lu, Benzhang Qiu, Tao Jiang, Kang An, Alex Noel Josephraj, Chuliang Wei

    Abstract: Automated steel bar counting and center localization plays an important role in the factory automation of steel bars. Traditional methods only focus on steel bar counting and their performances are often limited by complex industrial environments. Convolutional neural network (CNN), which has great capability to deal with complex tasks in challenging environments, is applied in this work. A framew… ▽ More

    Submitted 19 June, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Ready to submit IEEE Transactions on Industrial Informatics