Zum Hauptinhalt springen

Showing 1–12 of 12 results for author: Laptev, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.07096  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

    Authors: Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2310.12378  [pdf, other

    eess.AS cs.SD

    The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

    Authors: Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

    Abstract: We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Spea… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023

  3. Confidence-based Ensembles of End-to-End Speech Recognition Models

    Authors: Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg

    Abstract: The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where on… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin, Ireland

  4. arXiv:2303.10384  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Powerful and Extensible WFST Framework for RNN-Transducer Losses

    Authors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg

    Abstract: This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose… ▽ More

    Submitted 18 March, 2023; originally announced March 2023.

    Comments: To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes island, Greece. 5 pages, 5 figures, 3 tables

  5. arXiv:2212.08703  [pdf, other

    eess.AS cs.CL cs.IT cs.LG

    Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition

    Authors: Aleksandr Laptev, Boris Ginsburg

    Abstract: This paper presents a class of new fast non-trainable entropy-based confidence estimation methods for automatic speech recognition. We show how per-frame entropy values can be normalized and aggregated to obtain a confidence measure per unit and per word for Connectionist Temporal Classification (CTC) and Recurrent Neural Network Transducer (RNN-T) models. Proposed methods have similar computation… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Comments: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar. 8 pages, 4 figures, 4 tables

  6. CTC Variations Through New WFST Topologies

    Authors: Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg

    Abstract: This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition. Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with <epsilon> back-off transitions; (2) the "minimal-CTC", that only adds <blank> self-loops when us… ▽ More

    Submitted 26 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted to Interspeech 2022, 5 pages, 2 figures, 7 tables

  7. arXiv:2104.02526  [pdf, ps, other

    eess.AS cs.CL cs.LG

    LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring

    Authors: Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko

    Abstract: Neural network-based language models are commonly used in rescoring approaches to improve the quality of modern automatic speech recognition (ASR) systems. Most of the existing methods are computationally expensive since they use autoregressive language models. We propose a novel rescoring approach, which processes the entire lattice in a single call to the model. The key feature of our rescoring… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: Submitted to InterSpeech 2021

  8. arXiv:2103.07186  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

    Authors: Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, Anton Mitrofanov, Ivan Medennikov, Yuri Matveev

    Abstract: With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, bu… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: 16 pages, 7 figures

  9. arXiv:2006.08274  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset

    Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

    Abstract: This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available valid… ▽ More

    Submitted 26 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: Accepted by SPECOM 2020

  10. Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

    Authors: Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, Aleksandr Laptev, Aleksei Romanenko

    Abstract: Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD mode… ▽ More

    Submitted 27 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  11. arXiv:2005.07157  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

    Authors: Aleksandr Laptev, Roman Korostik, Aleksey Svischev, Andrei Andrusenko, Ivan Medennikov, Sergey Rybin

    Abstract: Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition mo… ▽ More

    Submitted 30 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

  12. arXiv:2004.10799  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

    Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

    Abstract: While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of challenging… ▽ More

    Submitted 7 August, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: Accepted by Interspeech 2020