Skip to main content

Showing 1–50 of 187 results for author: Schmid, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13579  [pdf, other

    cs.CL

    Towards Zero-Shot Multimodal Machine Translation

    Authors: Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden

    Abstract: Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review

  2. arXiv:2407.10910  [pdf, other

    cs.CV cs.LG

    DataDream: Few-shot Guided Dataset Generation

    Authors: Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata

    Abstract: While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hinder… ▽ More

    Submitted 16 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  3. arXiv:2406.08707  [pdf, other

    cs.CL cs.CV

    mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

    Authors: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot

    Abstract: Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Preprint. Under review

  4. arXiv:2405.17151  [pdf, other

    cs.LG

    Smoke and Mirrors in Causal Downstream Tasks

    Authors: Riccardo Cadei, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello

    Abstract: Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena. As many scientific questions are inherently causal, this paper looks at the causal inference task of treatment effect estimation, where we assume binary effects that are recorded as high-dimensional images in a Randomized Controlled Trial (RCT).… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  5. arXiv:2404.17498  [pdf, other

    cs.CV

    Learning text-to-video retrieval from image captioning

    Authors: Lucas Ventura, Cordelia Schmid, Gül Varol

    Abstract: We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labelin… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: A short version of this work appeared at CVPR 2023 Workshops. Project page: https://imagine.enpc.fr/~ventural/multicaps/

  6. arXiv:2404.15709  [pdf, other

    cs.CV cs.LG cs.RO

    ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

    Authors: Zerui Chen, Shizhe Chen, Cordelia Schmid, Ivan Laptev

    Abstract: In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses. Though prior work has demonstrated that human videos can benefit policy learning, performance improvement has been limited by physically implausible trajectories extracted from videos. Moreover, reliance on privileged object information such as ground-truth… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Project Page: https://zerchen.github.io/projects/vividex.html

  7. arXiv:2404.06511  [pdf, other

    cs.CV cs.AI cs.LG

    MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

    Authors: Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

    Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike tradit… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  8. arXiv:2404.03924  [pdf, other

    cs.CV

    Learning Correlation Structures for Vision Transformers

    Authors: Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

    Abstract: We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages ri… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  9. arXiv:2404.01491  [pdf, other

    cs.CV

    SUGAR: Pre-training 3D Visual Representations for Robotics

    Authors: Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

    Abstract: Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile, 3D representation learning has been limited to single-object understanding. To address these limitations, we introd… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project webpage: https://cshizhe.github.io/projects/robot_sugar.html

  10. arXiv:2404.01297  [pdf, other

    cs.CV

    Streaming Dense Video Captioning

    Authors: Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

    Abstract: An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/streaming_dvc

  11. arXiv:2403.02041  [pdf, other

    cs.CV

    A Generative Approach for Wikipedia-Scale Visual Entity Recognition

    Authors: Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

    Abstract: In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively,… ▽ More

    Submitted 21 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: CVPR2024

  12. arXiv:2403.01248  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

    Authors: Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi

    Abstract: This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

  13. arXiv:2402.02887  [pdf, other

    cs.CV cs.LG

    Time-, Memory- and Parameter-Efficient Visual Adaptation

    Authors: Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

    Abstract: As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  14. arXiv:2401.06035  [pdf, other

    cs.CV cs.LG

    RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

    Authors: Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf

    Abstract: We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence. Ind… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  15. arXiv:2312.09237  [pdf, other

    cs.CV

    Pixel Aligned Language Models

    Authors: Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

    Abstract: Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. I… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Project page: https://jerryxu.net/PixelLLM

  16. arXiv:2312.00786  [pdf, other

    cs.CV

    Dense Optical Tracking: Connecting the Dots

    Authors: Guillaume Le Moing, Jean Ponce, Cordelia Schmid

    Abstract: Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set… ▽ More

    Submitted 4 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024

  17. arXiv:2309.15596  [pdf, other

    cs.RO cs.CV

    PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

    Authors: Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

    Abstract: The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to CoRL 2023. Project website: https://www.di.ens.fr/willow/research/polarnet/

  18. arXiv:2309.13952  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    VidChapters-7M: Video Chapters at Scale

    Authors: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at NeurIPS 2023 Track on Datasets and Benchmarks; Project Webpage: https://antoyang.github.io/vidchapters.html ; 31 pages; 8 figures

  19. CoVR: Learning Composed Video Retrieval from Web Video Captions

    Authors: Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

    Abstract: Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expens… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: AAAI 2024, Updated the results on CIRR with the correct evaluation. Project page: Project page: https://imagine.enpc.fr/~ventural/covr/

  20. arXiv:2308.12965  [pdf, other

    cs.CV

    POCO: 3D Pose and Shape Estimation with Confidence

    Authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

    Abstract: The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the con… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

  21. arXiv:2308.11062  [pdf, other

    cs.CV cs.LG

    UnLoc: A Unified Framework for Video Localization Tasks

    Authors: Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

    Abstract: While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  22. arXiv:2308.05602  [pdf, other

    cs.CV cs.RO

    Object Goal Navigation with Recursive Implicit Maps

    Authors: Shizhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

    Abstract: Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments. Classical methods explicitly build maps of environments and require extensive engineering while lacking semantic information for object-oriented exploration. On the other hand, end-to-end learning methods alleviate manual map design and predict actions using implicit representations. Su… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted to IROS 2023

  23. arXiv:2307.15320  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Robust Visual Sim-to-Real Transfer for Robotic Manipulation

    Authors: Ricardo Garcia, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

    Abstract: Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  24. arXiv:2307.08506  [pdf, other

    cs.CV cs.AI cs.LG

    Does Visual Pretraining Help End-to-End Reasoning?

    Authors: Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

    Abstract: We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to so… ▽ More

    Submitted 15 December, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  25. arXiv:2306.11729  [pdf, other

    cs.CV

    Dense Video Object Captioning from Disjoint Supervision

    Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

    Abstract: We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and tempo… ▽ More

    Submitted 9 April, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/densevoc

  26. arXiv:2306.11726  [pdf, other

    cs.CV

    How can objects help action recognition?

    Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

    Abstract: Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accurac… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: CVPR 2023

  27. arXiv:2306.08129  [pdf, other

    cs.CV cs.AI cs.CL

    AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

    Authors: Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A Ross, Cordelia Schmid, Alireza Fathi

    Abstract: In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external… ▽ More

    Submitted 2 November, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Published on NeurIPS 2023

  28. arXiv:2306.07282  [pdf, other

    cs.CV cs.LG

    Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

    Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

    Abstract: The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose Wa… ▽ More

    Submitted 16 August, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: Accepted to ICCV 2023. Main paper with 9 pages

  29. arXiv:2306.07196  [pdf, other

    cs.CV

    Retrieval-Enhanced Contrastive Vision-Text Models

    Authors: Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

    Abstract: Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of conc… ▽ More

    Submitted 21 February, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

  30. arXiv:2306.05392  [pdf, other

    cs.CL

    Modular Visual Question Answering via Code Generation

    Authors: Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

    Abstract: We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the o… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  31. arXiv:2305.06289  [pdf, other

    cs.RO cs.CV cs.LG

    Learning Video-Conditioned Policies for Unseen Manipulation Tasks

    Authors: Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

    Abstract: The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challen… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

    Comments: ICRA 2023. See the project webpage at https://www.di.ens.fr/willow/research/vip/

  32. arXiv:2304.12160  [pdf, other

    cs.CV

    End-to-End Spatio-Temporal Action Localisation with Video Transformers

    Authors: Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

    Abstract: The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an input video, and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

  33. arXiv:2304.11970  [pdf, other

    cs.CV

    gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

    Authors: Zerui Chen, Shizhe Chen, Cordelia Schmid, Ivan Laptev

    Abstract: Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the underlying 3D geometry. In this work, we exploit the hand structure and use it as guidance for SDF-based shape reconstruction. In particular, we addr… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: Accepted by CVPR 2023. Project Page: https://zerchen.github.io/projects/gsdf.html

  34. arXiv:2304.06708  [pdf, other

    cs.CV cs.AI cs.CL

    Verbs in Action: Improving verb understanding in video-language models

    Authors: Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

    Abstract: Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In th… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

  35. arXiv:2304.06372  [pdf, other

    cs.RO

    Contact Models in Robotics: a Comparative Analysis

    Authors: Quentin Le Lidec, Wilson Jallet, Louis Montaut, Ivan Laptev, Cordelia Schmid, Justin Carpentier

    Abstract: Physics simulation is ubiquitous in robotics. Whether in model-based approaches (e.g., trajectory optimization), or model-free algorithms (e.g., reinforcement learning), physics simulators are a central component of modern control pipelines in robotics. Over the past decades, several robotic simulators have been developed, each with dedicated contact modeling assumptions and algorithmic solutions.… ▽ More

    Submitted 21 June, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

  36. arXiv:2304.05173  [pdf, other

    cs.CV cs.LG

    Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

    Authors: Ahmet Iscen, Alireza Fathi, Cordelia Schmid

    Abstract: Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the… ▽ More

    Submitted 11 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPR 2023

  37. arXiv:2304.03391  [pdf, other

    cs.CV

    Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

    Authors: Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

    Abstract: Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa. However, image-text retrieval models commonly learn to memorize spurious correlations in the training data, such as frequent object co-occurrence, instead of looking at the actual underlying reasons for the prediction in the image. For image-text retrieval, this man… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR'23 MULA Workshop

  38. arXiv:2304.01804  [pdf, other

    cs.CV

    Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

    Authors: Youngwook Kim, Jae Myung Kim, Jieun Jeong, Cordelia Schmid, Zeynep Akata, Jungwoo Lee

    Abstract: Due to the expensive costs of collecting labels in multi-label classification datasets, partially annotated multi-label classification has become an emerging field in computer vision. One baseline approach to this task is to assume unobserved labels as negative labels, but this assumption induces label noise as a form of false negative. To understand the negative impact caused by false negative la… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: CVPR2023 Camera-ready

  39. arXiv:2303.16501  [pdf, other

    cs.CV cs.SD eess.AS

    AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only mode… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  40. arXiv:2302.14115  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

    Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, w… ▽ More

    Submitted 21 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures

  41. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

    Authors: Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

    Abstract: One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training d… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023

  42. arXiv:2212.05922  [pdf, other

    cs.CV cs.SD

    Audiovisual Masked Autoencoders

    Authors: Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

    Abstract: Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisu… ▽ More

    Submitted 4 January, 2024; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: ICCV 2023

  43. arXiv:2212.05221  [pdf, other

    cs.CV cs.AI

    REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

    Authors: Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi

    Abstract: In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g.… ▽ More

    Submitted 3 April, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: Published on CVPR 2023

  44. arXiv:2212.02400  [pdf, other

    cs.CV

    Location-Aware Self-Supervised Transformers for Semantic Segmentation

    Authors: Mathilde Caron, Neil Houlsby, Cordelia Schmid

    Abstract: Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step to improve models on a task like semantic segmentation. However, prominent algorithms for pretraining neural networks use image-level objectives, e.g. image classification, image-text alignment a la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which m… ▽ More

    Submitted 15 March, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

  45. arXiv:2211.14308  [pdf, other

    cs.CV

    WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

    Authors: Guillaume Le Moing, Jean Ponce, Cordelia Schmid

    Abstract: This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining par… ▽ More

    Submitted 29 August, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: Accepted to ICCV 2023

  46. arXiv:2211.09966  [pdf, ps, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    AVATAR submission to the Ego4D AV Transcription Challenge

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming t… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  47. arXiv:2211.09646  [pdf, other

    cs.CV

    Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this e… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted in NeurIPS 2022; Project website: https://cshizhe.github.io/projects/vil3dref.html

  48. arXiv:2211.09019  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Learning Reward Functions for Robotic Manipulation by Observing Humans

    Authors: Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

    Abstract: Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-… ▽ More

    Submitted 7 March, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

  49. arXiv:2210.04485  [pdf, other

    cs.CV cs.LG

    A Memory Transformer Network for Incremental Learning

    Authors: Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

    Abstract: We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes. One of the most successful existing methods has been the use of a memo… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  50. arXiv:2209.09006  [pdf, other

    cs.RO cs.LG

    Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control

    Authors: Quentin Le Lidec, Wilson Jallet, Ivan Laptev, Cordelia Schmid, Justin Carpentier

    Abstract: Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly conver… ▽ More

    Submitted 16 February, 2023; v1 submitted 19 September, 2022; originally announced September 2022.