Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Zhu, Jiaxu; Tong, Weinan; Xu, Yaoxun; Song, Changhe; Wu, Zhiyong; You, Zhao; Su, Dan; Yu, Dong; Meng, Helen

doi:10.21437/Interspeech.2023-1378

Computer Science > Sound

arXiv:2309.02459 (cs)

[Submitted on 4 Sep 2023 (v1), last revised 7 Oct 2023 (this version, v2)]

Title:Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Authors:Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng

View PDF

Abstract:Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

Comments:	Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2309.02459 [cs.SD]
	(or arXiv:2309.02459v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2309.02459
Related DOI:	https://doi.org/10.21437/Interspeech.2023-1378

Submission history

From: Jiaxu Zhu [view email]
[v1] Mon, 4 Sep 2023 08:52:59 UTC (955 KB)
[v2] Sat, 7 Oct 2023 04:23:48 UTC (955 KB)

Computer Science > Sound

Title:Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators