Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Mittal, Ashish; Sivasubramanian, Durga; Iyer, Rishabh; Jyothi, Preethi; Ramakrishnan, Ganesh

Computer Science > Machine Learning

arXiv:2210.16892 (cs)

[Submitted on 30 Oct 2022]

Title:Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Authors:Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan

View PDF

Abstract:Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2210.16892 [cs.LG]
	(or arXiv:2210.16892v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2210.16892

Submission history

From: Durga S [view email]
[v1] Sun, 30 Oct 2022 17:22:57 UTC (917 KB)

Computer Science > Machine Learning

Title:Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators