Zum Hauptinhalt springen

Showing 1–8 of 8 results for author: Bataev, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.07096  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

    Authors: Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2406.06220  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Label-Looping: Highly Efficient Decoding for Transducers

    Authors: Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blan… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  3. arXiv:2406.03791  [pdf, other

    cs.LG

    Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

    Authors: Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

    Abstract: The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024 Proceedings

  4. arXiv:2303.10384  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Powerful and Extensible WFST Framework for RNN-Transducer Losses

    Authors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg

    Abstract: This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose… ▽ More

    Submitted 18 March, 2023; originally announced March 2023.

    Comments: To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes island, Greece. 5 pages, 5 figures, 3 tables

  5. arXiv:2302.14036  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

    Authors: Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram generator with a GAN-based enhancer to improve the spectrogram quality. The proposed syst… ▽ More

    Submitted 16 August, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted to INTERSPEECH 2023

  6. arXiv:2103.09354  [pdf, other

    cs.CV cs.AI cs.LG

    Digital Peter: Dataset, Competition and Handwriting Recognition Methods

    Authors: Mark Potanin, Denis Dimitrov, Alex Shonenkov, Vladimir Bataev, Denis Karachev, Maxim Novopoltsev

    Abstract: This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation procedure that converts initial images of documents into the lines. The new dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models. It consists of 9 694 images and text files corresponding to lines in historical documents. The ope… ▽ More

    Submitted 27 August, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

    Comments: 17 pages, 7 figures, submitted to ICDAR 2021

    ACM Class: I.7.5; I.4.6

  7. arXiv:2003.09024  [pdf, ps, other

    cs.CL cs.LG

    Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

    Authors: Nikolay Malkovsky, Vladimir Bataev, Dmitrii Sviridkin, Natalia Kizhaeva, Aleksandr Laptev, Ildar Valiev, Oleg Petrov

    Abstract: The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid systems are usually constructed to recognize a fixed set of words and rarely can include all the words that will be encountered during exploitation of the system. One of the popular approach to cover OOVs is to use subword units rather then words. Such system can potentially recognize any previously u… ▽ More

    Submitted 19 March, 2020; originally announced March 2020.

    Comments: Submitted to Interspeech 2020

  8. arXiv:1807.00868  [pdf, other

    cs.SD cs.CL eess.AS

    Exploring End-to-End Techniques for Low-Resource Speech Recognition

    Authors: Vladimir Bataev, Maxim Korenevsky, Ivan Medennikov, Alexander Zatvornitskiy

    Abstract: In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using s… ▽ More

    Submitted 2 July, 2018; originally announced July 2018.

    Comments: Accepted for Specom 2018, 20th International Conference on Speech and Computer