Skip to main content

Showing 1–50 of 96 results for author: Cui, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02990  [pdf, other

    cs.CV

    Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

    Authors: Mengmeng Cui, Kunbo Zhang, Zhenan Sun

    Abstract: In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and tem… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  2. arXiv:2406.17800  [pdf, other

    q-bio.QM cs.SD eess.AS

    Fish Tracking, Counting, and Behaviour Analysis in Digital Aquaculture: A Comprehensive Review

    Authors: Meng Cui, Xubo Liu, Haohe Liu, Jinzheng Zhao, Daoliang Li, Wenwu Wang

    Abstract: Digital aquaculture leverages advanced technologies and data-driven methods, providing substantial benefits over traditional aquaculture practices. Fish tracking, counting, and behaviour analysis are crucial components of digital aquaculture, which are essential for optimizing production efficiency, enhancing fish welfare, and improving resource management. Previous reviews have focused on single… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  3. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  4. arXiv:2406.10160  [pdf, other

    cs.SD cs.AI eess.AS

    One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

    Authors: Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

    Abstract: We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrat… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  5. arXiv:2406.10034  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

    Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jing, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More

    Submitted 16 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

  6. arXiv:2406.07989  [pdf, other

    cs.IT eess.SP

    Near-Field Wideband Beam Training Based on Distance-Dependent Beam Split

    Authors: Tianyue Zheng, Mingyao Cui, Zidong Wu, Linglong Dai

    Abstract: Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.07081  [pdf, other

    cs.CL

    Efficiently Exploring Large Language Models for Document-Level Machine Translation with In-context Learning

    Authors: Menglong Cui, Jiangcun Du, Shaolin Zhu, Deyi Xiong

    Abstract: Large language models (LLMs) exhibit outstanding performance in machine translation via in-context learning. In contrast to sentence-level translation, document-level translation (DOCMT) by LLMs based on in-context learning faces two major challenges: firstly, document translations generated by LLMs are often incoherent; secondly, the length of demonstration for in-context learning is usually limi… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL2024 long paper (Findings)

  8. arXiv:2406.02230  [pdf, other

    cs.CV

    I4VGen: Image as Stepping Stone for Text-to-Video Generation

    Authors: Xiefan Guo, Jinlin Liu, Miaomiao Cui, Di Huang

    Abstract: Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video,… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Project page: https://xiefan-guo.github.io/i4vgen

  9. arXiv:2405.16393  [pdf, other

    cs.CV cs.AI

    Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

    Authors: Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

    Abstract: Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often… ▽ More

    Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  10. arXiv:2405.12850  [pdf, other

    cs.CV

    Weakly supervised alignment and registration of MR-CT for cervical cancer radiotherapy

    Authors: Jjahao Zhang, Yin Gu, Deyu Sun, Yuhua Gao, Ming Gao, Ming Cui, Teng Zhang, He Ma

    Abstract: Cervical cancer is one of the leading causes of death in women, and brachytherapy is currently the primary treatment method. However, it is important to precisely define the extent of paracervical tissue invasion to improve cancer diagnosis and treatment options. The fusion of the information characteristics of both computed tomography (CT) and magnetic resonance imaging(MRI) modalities may be use… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  11. arXiv:2405.12601  [pdf, other

    cs.CV

    FFAM: Feature Factorization Activation Map for Explanation of 3D Detectors

    Authors: Shuai Liu, Boyang Li, Zhiyu Fang, Mingyue Cui, Kai Huang

    Abstract: LiDAR-based 3D object detection has made impressive progress recently, yet most existing models are black-box, lacking interpretability. Previous explanation approaches primarily focus on analyzing image-based models and are not readily applicable to LiDAR-based 3D detectors. In this paper, we propose a feature factorization activation map (FFAM) to generate high-quality visual explanations for 3D… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  12. arXiv:2404.06806  [pdf, other

    cs.IT eess.SP

    Near-Optimal Channel Estimation for Dense Array Systems

    Authors: Mingyao Cui, Zijian Zhang, Linglong Dai, Kaibin Huang

    Abstract: By deploying a large number of antennas with sub-half-wavelength spacing in a compact space, dense array systems(DASs) can fully unleash the multiplexing-and-diversity gains of limited apertures. To acquire these gains, accurate channel state information acquisition is necessary but challenging due to the large antenna numbers. To overcome this obstacle, this paper reveals that exploiting the high… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 19 pages, 10 figures

  13. arXiv:2404.04864  [pdf, other

    cs.IT

    Towards Atomic MIMO Receivers

    Authors: Mingyao Cui, Qunsong Zeng, Kaibin Huang

    Abstract: The advancement of Rydberg atoms in quantum sensing is driving a paradigm shift from classical receivers to atomic receivers. Capitalizing on the extreme sensitivity of Rydberg atoms to external disturbance, atomic receivers can measure radio-waves more precisely than classical receivers to support high-performance wireless communication and sensing. Although the atomic receiver is developing rapi… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: 13 pages, 8 figures. Submitted to IEEE for possible publication

  14. arXiv:2404.04650  [pdf, other

    cs.CV

    InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

    Authors: Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang

    Abstract: Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid i… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  15. arXiv:2404.00471  [pdf, other

    physics.med-ph cs.CV cs.LG eess.IV

    Score-Based Diffusion Models for Photoacoustic Tomography Image Reconstruction

    Authors: Sreemanti Dey, Snigdha Saha, Berthy T. Feng, Manxiu Cui, Laure Delisle, Oscar Leong, Lihong V. Wang, Katherine L. Bouman

    Abstract: Photoacoustic tomography (PAT) is a rapidly-evolving medical imaging modality that combines optical absorption contrast with ultrasound imaging depth. One challenge in PAT is image reconstruction with inadequate acoustic signals due to limited sensor coverage or due to the density of the transducer array. Such cases call for solving an ill-posed inverse reconstruction problem. In this work, we use… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 5 pages

    Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 2470-2474

  16. arXiv:2403.09167  [pdf, other

    cs.CL

    Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse

    Authors: Jianwei Sun, Chaoyang Mei, Linlin Wei, Kaiyu Zheng, Na Liu, Ming Cui, Tianyi Li

    Abstract: The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designe… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  17. arXiv:2402.17292  [pdf, other

    cs.CV

    DivAvatar: Diverse 3D Avatar Generation with a Single Prompt

    Authors: Weijing Tao, Biwen Lei, Kunhao Liu, Shijian Lu, Miaomiao Cui, Xuansong Xie, Chunyan Miao

    Abstract: Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  18. arXiv:2401.14051  [pdf, other

    cs.GR cs.CV

    A real-time rendering method for high albedo anisotropic materials with multiple scattering

    Authors: Shun Fang, Xing Feng, Ming Cui

    Abstract: We propose a neural network-based real-time volume rendering method for realistic and efficient rendering of volumetric media. The traditional volume rendering method uses path tracing to solve the radiation transfer equation, which requires a huge amount of calculation and cannot achieve real-time rendering. Therefore, this paper uses neural networks to simulate the iterative integration process… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  19. arXiv:2401.12456  [pdf, ps, other

    cs.CV cs.AI cs.GR

    Exploration and Improvement of Nerf-based 3D Scene Editing Techniques

    Authors: Shun Fang, Ming Cui, Xing Feng, Yanan Zhang

    Abstract: NeRF's high-quality scene synthesis capability was quickly accepted by scholars in the years after it was proposed, and significant progress has been made in 3D scene representation and synthesis. However, the high computational cost limits intuitive and efficient editing of scenes, making NeRF's development in the scene editing field facing many challenges. This paper reviews the preliminary expl… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  20. Methods and strategies for improving the novel view synthesis quality of neural radiation field

    Authors: Shun Fang, Ming Cui, Xing Feng, Yanna Lv

    Abstract: Neural Radiation Field (NeRF) technology can learn a 3D implicit model of a scene from 2D images and synthesize realistic novel view images. This technology has received widespread attention from the industry and has good application prospects. In response to the problem that the rendering quality of NeRF images needs to be improved, many researchers have proposed various methods to improve the re… ▽ More

    Submitted 17 April, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    ACM Class: I.2; I.4; I.6

    Journal ref: IEEE ACCESS 12 (2024) 50548-50555

  21. arXiv:2401.04152  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

    Authors: Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output train… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP2024

  22. arXiv:2401.02777  [pdf, other

    cs.CL cs.AI

    From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models

    Authors: Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, Ming Cui

    Abstract: This paper introduces RAISE (Reasoning and Acting through Scratchpad and Examples), an advanced architecture enhancing the integration of Large Language Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. I… ▽ More

    Submitted 30 January, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

  23. arXiv:2401.01173  [pdf, other

    cs.CV

    En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

    Authors: Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, Xuansong Xie

    Abstract: We present En3D, an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors, our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic, geometrically accurate and content-wise diverse 3D humans without r… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: Project Page: https://menyifang.github.io/projects/En3D/index.html

  24. arXiv:2312.16837  [pdf, other

    cs.CV

    DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

    Authors: Biwen Lei, Kai Yu, Mengyang Feng, Miaomiao Cui, Xuansong Xie

    Abstract: Text-guided domain adaptation and generation of 3D-aware portraits find many applications in various fields. However, due to the lack of training data and the challenges in handling the high variety of geometry and appearance, the existing methods for these tasks suffer from issues like inflexibility, instability, and low fidelity. In this paper, we propose a novel framework DiffusionGAN3D, which… ▽ More

    Submitted 12 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: Accepted by CVPR2024

  25. arXiv:2312.05107  [pdf, other

    cs.CV

    DreaMoving: A Human Video Generation Framework based on Diffusion Models

    Authors: Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie

    Abstract: In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content G… ▽ More

    Submitted 11 December, 2023; v1 submitted 8 December, 2023; originally announced December 2023.

    Comments: 5 pages, 5 figures, Tech. Report

  26. arXiv:2311.13617  [pdf, other

    cs.CV

    Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

    Authors: Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie

    Abstract: We present Boosting3D, a multi-stage single image-to-3D generation method that can robustly generate reasonable 3D objects in different data domains. The point of this work is to solve the view consistency problem in single image-guided 3D generation by modeling a reasonable geometric structure. For this purpose, we propose to utilize better 3D prior to training the NeRF. More specifically, we tra… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 8 pages, 7 figures, 1 table

  27. arXiv:2311.13141  [pdf, other

    cs.CV

    Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

    Authors: Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie

    Abstract: This is a technical report on the 360-degree panoramic image generation task based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic images capture the entire $360^\circ\times 180^\circ$ field of view. So the rightmost and the leftmost sides of the 360 panoramic image should be continued, which is the main challenge in this field. However, the current diffusion pipeline is not a… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 2 pages, 8 figures, Tech. Report

  28. arXiv:2311.11952  [pdf, other

    quant-ph cs.CV cs.ET

    Quantum Image Segmentation Based on Grayscale Morphology

    Authors: Wenjie Liu, Lu Wang, Mengmeng Cui

    Abstract: The classical image segmentation algorithm based on grayscale morphology can effectively segment images with uneven illumination, but with the increase of the image data, the real-time problem will emerge. In order to solve this problem, a quantum image segmentation algorithm is proposed in this paper, which can use quantum mechanism to simultaneously perform morphological operations on all pixels… ▽ More

    Submitted 2 October, 2023; originally announced November 2023.

    Comments: 20 pages, 12 figures

    Journal ref: IEEE Transactions on Quantum Engineering, 2022.3: p.3103012

  29. arXiv:2311.05267  [pdf, other

    cs.DC

    Analysis and Characterization of Performance Variability for OpenMP Runtime

    Authors: Minyu Cui, Nikela Papadopoulou, Miquel Pericàs

    Abstract: In the high performance computing (HPC) domain, performance variability is a major scalability issue for parallel computing applications with heavy synchronization and communication. In this paper, we present an experimental performance analysis of OpenMP benchmarks regarding the variation of execution time, and determine the potential factors causing performance variability. Our work offers some… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: To appear at ROSS 2023 (International Workshop on Runtime and Operating Systems for Supercomputers), held in conjunction with SC23

  30. arXiv:2310.18075  [pdf, other

    cs.CL cs.AI

    DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking

    Authors: Xiaoyu Tian, Liangyu Chen, Na Liu, Yaxuan Liu, Wei Zou, Kaijiang Chen, Ming Cui

    Abstract: Inspired by the dual-process theory of human cognition, we introduce DUMA, a novel conversational agent framework that embodies a dual-mind mechanism through the utilization of two generative Large Language Models (LLMs) dedicated to fast and slow thinking respectively. The fast thinking model serves as the primary interface for external interactions and initial response generation, evaluating the… ▽ More

    Submitted 24 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

  31. arXiv:2310.14778  [pdf, other

    cs.MM cs.SD eess.AS

    Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

    Authors: Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

    Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we condu… ▽ More

    Submitted 17 December, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

  32. arXiv:2309.05058  [pdf, other

    cs.SD cs.MM eess.AS

    Multimodal Fish Feeding Intensity Assessment in Aquaculture

    Authors: Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang

    Abstract: Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio… ▽ More

    Submitted 19 May, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

  33. arXiv:2309.04608  [pdf, other

    cs.CV cs.MM

    Style Generation: Image Synthesis based on Coarsely Matched Texts

    Authors: Mengyao Cui, Zhe Zhu, Shao-Ping Lu, Yulu Yang

    Abstract: Previous text-to-image synthesis algorithms typically use explicit textual instructions to generate/manipulate images accurately, but they have difficulty adapting to guidance in the form of coarsely matched texts. In this work, we attempt to stylize an input image using such coarsely matched text as guidance. To tackle this new problem, we introduce a novel task called text-based style generation… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  34. arXiv:2308.04787  [pdf, other

    cs.SE

    rCanary: Detecting Memory Leaks Across Semi-automated Memory Management Boundary in Rust

    Authors: Mohan Cui, Suran Sun, Hui Xu, Yangfan Zhou

    Abstract: Rust is an effective system programming language that guarantees memory safety via compile-time verifications. It employs a novel ownership-based resource management model to facilitate automated resource deallocation. It is anticipated that this model will eliminate memory leaks. However, we observed that user intervention driving semi-automated management is prone to introducing leaks. In contra… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  35. arXiv:2308.04785  [pdf, other

    cs.SE

    Is unsafe an Achilles' Heel? A Comprehensive Study of Safety Requirements in Unsafe Rust Programming

    Authors: Mohan Cui, Suran Sun, Hui Xu, Yangfan Zhou

    Abstract: Rust is an emerging, strongly-typed programming language focusing on efficiency and memory safety. With increasing projects adopting Rust, knowing how to use Unsafe Rust is crucial for Rust security. We observed that the description of safety requirements needs to be unified in Unsafe Rust programming. Current unsafe API documents in the standard library exhibited variations, including inconsisten… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  36. arXiv:2307.16518  [pdf, other

    cs.IT eess.SP

    Continuous-Time Channel Prediction Based on Tensor Neural Ordinary Differential Equation

    Authors: Mingyao Cui, Hao Jiang, Yuhao Chen, Yang Du, Linglong Dai

    Abstract: Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, es… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: A tensor neural ODE based method is proposed to predict continuous-time wireless channels

  37. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  38. arXiv:2307.02909  [pdf, other

    eess.AS cs.AI cs.SD

    Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

    Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu

    Abstract: Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is pro… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  39. arXiv:2306.14608  [pdf, other

    eess.AS cs.CL

    Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

    Authors: Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

    Abstract: Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately mo… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  40. arXiv:2306.13307  [pdf, other

    eess.AS cs.CL

    Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

    Authors: Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu

    Abstract: Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of t… ▽ More

    Submitted 25 June, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  41. arXiv:2306.08865  [pdf, other

    cs.CV cs.LG

    One-Shot Learning of Visual Path Navigation for Autonomous Vehicles

    Authors: Zhongying CuiZhu, Francois Charette, Amin Ghafourian, Debo Shi, Matthew Cui, Anjali Krishnamachar, Iman Soltani

    Abstract: Autonomous driving presents many challenges due to the large number of scenarios the autonomous vehicle (AV) may encounter. End-to-end deep learning models are comparatively simplistic models that can handle a broad set of scenarios. However, end-to-end models require large amounts of diverse data to perform well. This paper presents a novel deep neural network that performs image-to-steering path… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Machine Learning for Autonomous Driving Workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS 20222), New Orleans, USA

  42. arXiv:2305.16263  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

    Authors: Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng

    Abstract: Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR)… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  43. arXiv:2305.10659  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Use of Speech Impairment Severity for Dysarthric Speech Recognition

    Authors: Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu

    Abstract: A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognit… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH2023

  44. arXiv:2304.03898  [pdf, other

    cs.CL cs.AI

    The Short Text Matching Model Enhanced with Knowledge via Contrastive Learning

    Authors: Ruiqiang Liu, Qiqiang Zhong, Mengmeng Cui, Hanjie Mai, Qiang Zhang, Shaohua Xu, Xiangzheng Liu, Yanlong Du

    Abstract: In recent years, short Text Matching tasks have been widely applied in the fields ofadvertising search and recommendation. The difficulty lies in the lack of semantic information and word ambiguity caused by the short length of the text. Previous works have introduced complement sentences or knowledge bases to provide additional feature information. However, these methods have not fully interacted… ▽ More

    Submitted 19 December, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

    Comments: 11 pages,2 figures

  45. arXiv:2302.14564  [pdf, other

    cs.SD cs.AI eess.AS

    Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng

    Abstract: Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends… ▽ More

    Submitted 22 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: accepted by ICASSP 2023

  46. arXiv:2302.14434  [pdf, other

    cs.CV

    A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images

    Authors: Biwen Lei, Jianqiang Ren, Mengyang Feng, Miaomiao Cui, Xuansong Xie

    Abstract: Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchic… ▽ More

    Submitted 28 March, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: Accepted by CVPR2023

  47. arXiv:2302.09908  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

    Authors: Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng

    Abstract: Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguisti… ▽ More

    Submitted 5 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

    Comments: Accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

  48. arXiv:2302.07521  [pdf, other

    eess.AS cs.SD

    Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

    Authors: Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, Xunying Liu

    Abstract: Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compac… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  49. arXiv:2212.14654  [pdf, other

    cs.IT eess.SP

    Enabling More Users to Benefit from Near-Field Communications: From Linear to Circular Array

    Authors: Zidong Wu, Mingyao Cui, Linglong Dai

    Abstract: Massive multiple-input multiple-output (MIMO) for 5G is evolving into the extremely large-scale antenna array (ELAA) to increase the spectrum efficiency by orders of magnitude for 6G communications. ELAA introduces spherical-wave-based near-field communications, where channel capacity can be significantly improved for single-user and multi-user scenarios. Unfortunately, the near-field region at la… ▽ More

    Submitted 30 October, 2023; v1 submitted 30 December, 2022; originally announced December 2022.

    Comments: Accepted by IEEE TWC. In this paper, the rotational symmetry of UCA is leveraged to provide uniform and enlarged near-field regions, enabling more users to benefit from near-field communications. Simulation codes will be provided to reproduce the results in this paper: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

  50. arXiv:2212.08401  [pdf, other

    cs.IT eess.SP

    Near-Field Wideband Channel Estimation for Extremely Large-Scale MIMO

    Authors: Mingyao Cui, Linglong Dai

    Abstract: Extremely large-scale multiple-input-multiple-output (XL-MIMO) at millimeter-wave (mmWave) and terahertz (THz) bands plays an important role in supporting extreme high beamforming gain as well as ultra-wideband spectrum resources. Unfortunately, accurate wideband XL-MIMO channel estimation suffers from the new challenge called as the near-field beam split effect. Prior works either neglect the acc… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Comments: This paper has been accepted by Science China Information Sciences. Simulation codes will be provided to reproduce the results in this paper: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html