Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Qiushi Zhu; Jie Zhang; Yu Gu; Yuchen Hu; Lirong Dai

doi:10.1609/aaai.v38i17.29951

Authors

Qiushi Zhu University of Science and Technology of China
Jie Zhang University of Science and Technology of China (USTC)
Yu Gu Tencent AI Lab
Yuchen Hu Nanyang Technological University
Lirong Dai University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v38i17.29951

Keywords:

NLP: Speech, ML: Multimodal Learning

Abstract

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription