User profiles for Andrew Rouditchenko

Andrew Rouditchenko

PhD Student at MIT CSAIL
Verified email at mit.edu
Cited by 1330

The sound of pixels

H Zhao, C Gan, A Rouditchenko… - Proceedings of the …, 2018 - openaccess.thecvf.com
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos,
learns to locate image regions which produce sounds and separate the input sounds into a …

Everything at once-multi-modal fusion transformer for video retrieval

…, B Chen, A Rouditchenko… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks like …

Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-…

Self-supervised audio-visual co-segmentation

A Rouditchenko, H Zhao, C Gan… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
Segmenting objects in images and separating sound sources in audio are challenging tasks,
in part because traditional approaches require large amounts of labeled data. In this paper …

Multimodal clustering networks for self-supervised learning from unlabeled videos

B Chen, A Rouditchenko, K Duarte… - Proceedings of the …, 2021 - openaccess.thecvf.com
Multimodal self-supervised learning is getting more and more attention as it allows not only
to train large networks without human supervision but also to search and retrieve data across …

Cmkd: Cnn/transformer-based cross-model knowledge distillation for audio classification

Y Gong, S Khurana, A Rouditchenko… - arXiv preprint arXiv …, 2022 - arxiv.org
Audio classification is an active research area with a wide range of applications. Over the
past decade, convolutional neural networks (CNNs) have been the de-facto standard building …

Cross-modal discrete representation learning

AH Liu, SY Jin, CIJ Lai, A Rouditchenko, A Oliva… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …

Comparison of multilingual self-supervised and weakly-supervised speech pre-training for adaptation to unseen languages

A Rouditchenko, S Khurana, S Thomas, R Feris… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent models such as XLS-R and Whisper have made multilingual speech technologies
more accessible by pre-training on audio from around 100 spoken languages each. However, …

Av-cpl: Continuous pseudo-labeling for audio-visual speech recognition

A Rouditchenko, R Collobert… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio-visual speech contains synchronized audio and visual information that provides cross-modal
supervision to learn representations for both automatic speech recognition (ASR) …