Google Scholar

User profiles for Andrew Rouditchenko

Andrew Rouditchenko

PhD Student at MIT CSAIL

Verified email at mit.edu

Cited by 1330

[PDF] thecvf.com

The sound of pixels

H Zhao, C Gan, A Rouditchenko… - Proceedings of the …, 2018 - openaccess.thecvf.com

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos,
learns to locate image regions which produce sounds and separate the input sounds into a …

Speichern Sie Cite Cited by 585 Related articles All 10 versions View as HTML

[PDF] thecvf.com

Everything at once-multi-modal fusion transformer for video retrieval

…, B Chen, A Rouditchenko… - Proceedings of the …, 2022 - openaccess.thecvf.com

Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks like …

Speichern Sie Cite Cited by 144 Related articles All 7 versions View as HTML

[PDF] arxiv.org

Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Speichern Sie Cite Cited by 145 Related articles All 9 versions View as HTML

[PDF] arxiv.org

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-…

Speichern Sie Cite Cited by 106 Related articles All 5 versions View as HTML

[PDF] arxiv.org

Self-supervised audio-visual co-segmentation

A Rouditchenko, H Zhao, C Gan… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org

Segmenting objects in images and separating sound sources in audio are challenging tasks,
in part because traditional approaches require large amounts of labeled data. In this paper …

Speichern Sie Cite Cited by 125 Related articles All 7 versions

[PDF] thecvf.com

Multimodal clustering networks for self-supervised learning from unlabeled videos

B Chen, A Rouditchenko, K Duarte… - Proceedings of the …, 2021 - openaccess.thecvf.com

Multimodal self-supervised learning is getting more and more attention as it allows not only
to train large networks without human supervision but also to search and retrieve data across …

Speichern Sie Cite Cited by 88 Related articles All 8 versions View as HTML

[PDF] arxiv.org

Cmkd: Cnn/transformer-based cross-model knowledge distillation for audio classification

Y Gong, S Khurana, A Rouditchenko… - arXiv preprint arXiv …, 2022 - arxiv.org

Audio classification is an active research area with a wide range of applications. Over the
past decade, convolutional neural networks (CNNs) have been the de-facto standard building …

Speichern Sie Cite Cited by 32 Related articles All 2 versions View as HTML

[PDF] arxiv.org

Cross-modal discrete representation learning

AH Liu, SY Jin, CIJ Lai, A Rouditchenko, A Oliva… - arXiv preprint arXiv …, 2021 - arxiv.org

Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …

Speichern Sie Cite Cited by 42 Related articles All 10 versions View as HTML

[PDF] arxiv.org

Comparison of multilingual self-supervised and weakly-supervised speech pre-training for adaptation to unseen languages

A Rouditchenko, S Khurana, S Thomas, R Feris… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent models such as XLS-R and Whisper have made multilingual speech technologies
more accessible by pre-training on audio from around 100 spoken languages each. However, …

Speichern Sie Cite Cited by 11 Related articles All 8 versions View as HTML

[PDF] arxiv.org

Av-cpl: Continuous pseudo-labeling for audio-visual speech recognition

A Rouditchenko, R Collobert… - arXiv preprint arXiv …, 2023 - arxiv.org

Audio-visual speech contains synchronized audio and visual information that provides cross-modal
supervision to learn representations for both automatic speech recognition (ASR) …

Speichern Sie Cite Cited by 1 Related articles All 3 versions View as HTML

Create alert

Cite

Advanced search

Saved to My library

User profiles for Andrew Rouditchenko

Andrew Rouditchenko

The sound of pixels

Everything at once-multi-modal fusion transformer for video retrieval

Avlnet: Learning audio-visual language representations from instructional videos

Contrastive audio-visual masked autoencoder

Self-supervised audio-visual co-segmentation

Multimodal clustering networks for self-supervised learning from unlabeled videos

Cmkd: Cnn/transformer-based cross-model knowledge distillation for audio classification

Cross-modal discrete representation learning

Comparison of multilingual self-supervised and weakly-supervised speech pre-training for adaptation to unseen languages

Av-cpl: Continuous pseudo-labeling for audio-visual speech recognition