Cross-modal discrete representation learning

AH Liu, SY Jin, CIJ Lai, A Rouditchenko, A Oliva… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level
embedding vector. In this work we present a self-supervised learning framework that is able
to learn a representation that captures finer levels of granularity across different modalities
such as concepts or events represented by visual objects or spoken words. Our framework
relies on a discretized embedding space created via vector quantization that is shared …

[PDF][PDF] Cross-Modal Discrete Representation Learning

AHLSYJ Cheng, IJLA Rouditchenko, AOJ Glass - olivalab.mit.edu
In contrast to recent advances focusing on highlevel representation learning across
modalities, in this work we present a self-supervised learning framework that is able to learn
a representation that captures finer levels of granularity across different modalities such as
concepts or events represented by visual objects or spoken words. Our framework relies on
a discretized embedding space created via vector quantization that is shared across
different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code …
Showing the best results for this search. See all results