User profiles for Andrew Rouditchenko
Andrew RouditchenkoPhD Student at MIT CSAIL Verified email at mit.edu Cited by 1330 |
The sound of pixels
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos,
learns to locate image regions which produce sounds and separate the input sounds into a …
learns to locate image regions which produce sounds and separate the input sounds into a …
Everything at once-multi-modal fusion transformer for video retrieval
…, B Chen, A Rouditchenko… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks like …
training of semantically meaningful embeddings without human annotation, enabling tasks like …
Avlnet: Learning audio-visual language representations from instructional videos
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …
annotation, such as human generated captions or machine generated automatic speech …
Contrastive audio-visual masked autoencoder
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-…
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-…
Self-supervised audio-visual co-segmentation
Segmenting objects in images and separating sound sources in audio are challenging tasks,
in part because traditional approaches require large amounts of labeled data. In this paper …
in part because traditional approaches require large amounts of labeled data. In this paper …
Multimodal clustering networks for self-supervised learning from unlabeled videos
Multimodal self-supervised learning is getting more and more attention as it allows not only
to train large networks without human supervision but also to search and retrieve data across …
to train large networks without human supervision but also to search and retrieve data across …
Cmkd: Cnn/transformer-based cross-model knowledge distillation for audio classification
Audio classification is an active research area with a wide range of applications. Over the
past decade, convolutional neural networks (CNNs) have been the de-facto standard building …
past decade, convolutional neural networks (CNNs) have been the de-facto standard building …
Cross-modal discrete representation learning
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …
information from different modalities such as video, text, and audio in a single high-level …
Comparison of multilingual self-supervised and weakly-supervised speech pre-training for adaptation to unseen languages
Recent models such as XLS-R and Whisper have made multilingual speech technologies
more accessible by pre-training on audio from around 100 spoken languages each. However, …
more accessible by pre-training on audio from around 100 spoken languages each. However, …
Av-cpl: Continuous pseudo-labeling for audio-visual speech recognition
A Rouditchenko, R Collobert… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio-visual speech contains synchronized audio and visual information that provides cross-modal
supervision to learn representations for both automatic speech recognition (ASR) …
supervision to learn representations for both automatic speech recognition (ASR) …