Visual Information Matters for ASR Error Correction

Kumar, Vanya Bannihatti; Cheng, Shanbo; Peng, Ningxin; Zhang, Yuchen

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2303.10160 (eess)

[Submitted on 16 Mar 2023 (v1), last revised 26 May 2023 (this version, v2)]

Title:Visual Information Matters for ASR Error Correction

Authors:Vanya Bannihatti Kumar, Shanbo Cheng, Ningxin Peng, Yuchen Zhang

View PDF

Abstract:Aiming to improve the Automatic Speech Recognition (ASR) outputs with a post-processing step, ASR error correction (EC) techniques have been widely developed due to their efficiency in using parallel text data. Previous works mainly focus on using text or/ and speech data, which hinders the performance gain when not only text and speech information, but other modalities, such as visual information are critical for EC. The challenges are mainly two folds: one is that previous work fails to emphasize visual information, thus rare exploration has been studied. The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. Experimental results show that using captions as prompts could effectively use the visual information and surpass state-of-the-art methods by upto 1.2% in Word Error Rate(WER), which also indicates that visual information is critical in our proposed Visual-ASR-EC dataset

Comments:	Accepted at ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2303.10160 [eess.AS]
	(or arXiv:2303.10160v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2303.10160

Submission history

From: Vanya Bannihatti Kumar Ms [view email]
[v1] Thu, 16 Mar 2023 06:33:53 UTC (1,538 KB)
[v2] Fri, 26 May 2023 08:37:06 UTC (1,399 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Visual Information Matters for ASR Error Correction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Visual Information Matters for ASR Error Correction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators