-
BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors:
Carl Doersch,
Pauline Luc,
Yi Yang,
Dilara Gokay,
Skanda Koppula,
Ankush Gupta,
Joseph Heyward,
Ignacio Rocco,
Ross Goroshin,
João Carreira,
Andrew Zisserman
Abstract:
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulat…
▽ More
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/
△ Less
Submitted 23 May, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Authors:
Chuhan Zhang,
Antoine Miech,
Jiajun Shen,
Jean-Baptiste Alayrac,
Pauline Luc
Abstract:
Large-scale visual language models are widely used as pre-trained models and then adapted for various downstream tasks. While humans are known to efficiently learn new tasks from a few examples, deep learning models struggle with adaptation from few examples. In this work, we look into task adaptation in the low-data regime, and provide a thorough study of the existing adaptation methods for gener…
▽ More
Large-scale visual language models are widely used as pre-trained models and then adapted for various downstream tasks. While humans are known to efficiently learn new tasks from a few examples, deep learning models struggle with adaptation from few examples. In this work, we look into task adaptation in the low-data regime, and provide a thorough study of the existing adaptation methods for generative Visual Language Models. And we show important benefits of self-labelling, i.e. using the model's own predictions to self-improve when having access to a larger number of unlabelled images of the same distribution. Our study demonstrates significant gains using our proposed task adaptation pipeline across a wide range of visual language tasks such as visual classification (ImageNet), visual captioning (COCO), detailed visual captioning (Localised Narratives) and visual question answering (VQAv2).
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
A Particle Finite Element Method based on Level Set functions
Authors:
Fernández Eduardo,
Février Simon,
Lacroix Martin,
Boman Romain,
Papeleux Luc,
Ponthot Jean-Philippe
Abstract:
Since the seminal work of Idelsohn, Oñate and Del-Pin (2004), the Particle Finite Element Method (PFEM) has relied on a Delaunay triangulation and the Alpha--Shape (AS) algorithm in the remeshing process. This approach guarantees a good quality of the Lagrangian mesh, but introduces a list of shortcomings that demand geometrical treatments tailored to each problem. In order to improve the remeshin…
▽ More
Since the seminal work of Idelsohn, Oñate and Del-Pin (2004), the Particle Finite Element Method (PFEM) has relied on a Delaunay triangulation and the Alpha--Shape (AS) algorithm in the remeshing process. This approach guarantees a good quality of the Lagrangian mesh, but introduces a list of shortcomings that demand geometrical treatments tailored to each problem. In order to improve the remeshing process in PFEM, this work proposes the use of a Level--Set (LS) function instead of the Alpha--Shape algorithm. Since the Level--Set considers the boundary of the fluid and its interior, and not only a geometric criterion as does the Alpha--Shape, the proposed strategy (PFEM--LS) shows more robustness than the classical approach (PFEM--AS) owing to three main improvements. First, the LS function allows for a better control over the elements that are created during the fluid/fluid contact, which helps to reduce mass creation. Second, it helps to preserve the smoothness of the free surface and to reduce mass loss. Third, it allows the meshing of solitary particles that are detached from the free surface, which improves the representation of drops in PFEM. The methodology is presented and validated using free surface flow problems in 2D.
△ Less
Submitted 29 April, 2023;
originally announced May 2023.
-
Zorro: the masked multimodal transformer
Authors:
Adrià Recasens,
Jason Lin,
Joāo Carreira,
Drew Jaegle,
Luyu Wang,
Jean-baptiste Alayrac,
Pauline Luc,
Antoine Miech,
Lucas Smaira,
Ross Hemsley,
Andrew Zisserman
Abstract:
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in…
▽ More
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
△ Less
Submitted 22 February, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Authors:
Jean-Baptiste Alayrac,
Jeff Donahue,
Pauline Luc,
Antoine Miech,
Iain Barr,
Yana Hasson,
Karel Lenc,
Arthur Mensch,
Katie Millican,
Malcolm Reynolds,
Roman Ring,
Eliza Rutherford,
Serkan Cabi,
Tengda Han,
Zhitao Gong,
Sina Samangooei,
Marianne Monteiro,
Jacob Menick,
Sebastian Borgeaud,
Andrew Brock,
Aida Nematzadeh,
Sahand Sharifzadeh,
Mikolaj Binkowski,
Ricardo Barreira,
Oriol Vinyals
, et al. (2 additional authors not shown)
Abstract:
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily i…
▽ More
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
△ Less
Submitted 15 November, 2022; v1 submitted 29 April, 2022;
originally announced April 2022.
-
Towards Learning Universal Audio Representations
Authors:
Luyu Wang,
Pauline Luc,
Yan Wu,
Adria Recasens,
Lucas Smaira,
Andrew Brock,
Andrew Jaegle,
Jean-Baptiste Alayrac,
Sander Dieleman,
Joao Carreira,
Aaron van den Oord
Abstract:
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learni…
▽ More
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models do not generalize outside of their domains. We observe that more robust audio representations can be learned with the SimCLR objective; however, the model's transferability depends heavily on the model architecture. We find the Slowfast architecture is good at learning rich representations required by different domains, but its performance is affected by the normalization scheme. Based on these findings, we propose a novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance across all domains.
△ Less
Submitted 23 June, 2022; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Multimodal Self-Supervised Learning of General Audio Representations
Authors:
Luyu Wang,
Pauline Luc,
Adria Recasens,
Jean-Baptiste Alayrac,
Aaron van den Oord
Abstract:
We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not req…
▽ More
We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-of-the-art of 42.4 mAP on the AudioSet classification downstream task, closing the gap between supervised and self-supervised methods trained on the same dataset. Moreover, we show that our method is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
△ Less
Submitted 28 April, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Broaden Your Views for Self-Supervised Video Learning
Authors:
Adrià Recasens,
Pauline Luc,
Jean-Baptiste Alayrac,
Luyu Wang,
Ross Hemsley,
Florian Strub,
Corentin Tallec,
Mateusz Malinowski,
Viorica Patraucean,
Florent Altché,
Michal Valko,
Jean-Bastien Grill,
Aäron van den Oord,
Andrew Zisserman
Abstract:
Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervise…
▽ More
Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.
△ Less
Submitted 19 October, 2021; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Game Plan: What AI can do for Football, and What Football can do for AI
Authors:
Karl Tuyls,
Shayegan Omidshafiei,
Paul Muller,
Zhe Wang,
Jerome Connor,
Daniel Hennes,
Ian Graham,
William Spearman,
Tim Waskett,
Dafydd Steele,
Pauline Luc,
Adria Recasens,
Alexandre Galashov,
Gregory Thornton,
Romuald Elie,
Pablo Sprechmann,
Pol Moreno,
Kris Cao,
Marta Garnelo,
Praneet Dutta,
Michal Valko,
Nicolas Heess,
Alex Bridgland,
Julien Perolat,
Bart De Vylder
, et al. (11 additional authors not shown)
Abstract:
The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented analytics possibilities in various team and individual sports, including baseball, basketball, and tennis. More recently, AI techniques have been applied to football, due to a huge increase in data collection by professional teams, increased computational power, and advances in machine learning, with t…
▽ More
The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented analytics possibilities in various team and individual sports, including baseball, basketball, and tennis. More recently, AI techniques have been applied to football, due to a huge increase in data collection by professional teams, increased computational power, and advances in machine learning, with the goal of better addressing new scientific challenges involved in the analysis of both individual players' and coordinated teams' behaviors. The research challenges associated with predictive and prescriptive football analytics require new developments and progress at the intersection of statistical learning, game theory, and computer vision. In this paper, we provide an overarching perspective highlighting how the combination of these fields, in particular, forms a unique microcosm for AI research, while offering mutual benefits for professional teams, spectators, and broadcasters in the years to come. We illustrate that this duality makes football analytics a game changer of tremendous value, in terms of not only changing the game of football itself, but also in terms of what this domain can mean for the field of AI. We review the state-of-the-art and exemplify the types of analysis enabled by combining the aforementioned fields, including illustrative examples of counterfactual analysis using predictive models, and the combination of game-theoretic analysis of penalty kicks with statistical learning of player attributes. We conclude by highlighting envisioned downstream impacts, including possibilities for extensions to other sports (real and virtual).
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Transformation-based Adversarial Video Prediction on Large-Scale Data
Authors:
Pauline Luc,
Aidan Clark,
Sander Dieleman,
Diego de Las Casas,
Yotam Doron,
Albin Cassirer,
Karen Simonyan
Abstract:
Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing…
▽ More
Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions and proposing an architecture that yields faster convergence and higher performance than previous approaches. We then analyze recurrent units in the generator, and propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to handle dis-occlusions, scene changes and other complex behavior. We show that this recurrent unit consistently outperforms previous designs. Our final model leads to a leap in the state-of-the-art performance, obtaining a test set Frechet Video Distance of 25.7, down from 69.2, on the large-scale Kinetics-600 dataset.
△ Less
Submitted 17 November, 2021; v1 submitted 9 March, 2020;
originally announced March 2020.
-
Fully Parallel Hyperparameter Search: Reshaped Space-Filling
Authors:
M. -L. Cauwet,
C. Couprie,
J. Dehos,
P. Luc,
J. Rapin,
M. Riviere,
F. Teytaud,
O. Teytaud
Abstract:
Space-filling designs such as scrambled-Hammersley, Latin Hypercube Sampling and Jittered Sampling have been proposed for fully parallel hyperparameter search, and were shown to be more effective than random or grid search. In this paper, we show that these designs only improve over random search by a constant factor. In contrast, we introduce a new approach based on reshaping the search distribut…
▽ More
Space-filling designs such as scrambled-Hammersley, Latin Hypercube Sampling and Jittered Sampling have been proposed for fully parallel hyperparameter search, and were shown to be more effective than random or grid search. In this paper, we show that these designs only improve over random search by a constant factor. In contrast, we introduce a new approach based on reshaping the search distribution, which leads to substantial gains over random search, both theoretically and empirically. We propose two flavors of reshaping. First, when the distribution of the optimum is some known $P_0$, we propose Recentering, which uses as search distribution a modified version of $P_0$ tightened closer to the center of the domain, in a dimension-dependent and budget-dependent manner. Second, we show that in a wide range of experiments with $P_0$ unknown, using a proposed Cauchy transformation, which simultaneously has a heavier tail (for unbounded hyperparameters) and is closer to the boundaries (for bounded hyperparameters), leads to improved performances. Besides artificial experiments and simple real world tests on clustering or Salmon mappings, we check our proposed methods on expensive artificial intelligence tasks such as attend/infer/repeat, video next frame segmentation forecasting and progressive generative adversarial networks.
△ Less
Submitted 20 January, 2020; v1 submitted 18 October, 2019;
originally announced October 2019.
-
Predicting Future Instance Segmentation by Forecasting Convolutional Features
Authors:
Pauline Luc,
Camille Couprie,
Yann LeCun,
Jakob Verbeek
Abstract:
Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these. In this paper we consider the more challenging problem of f…
▽ More
Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these. In this paper we consider the more challenging problem of future instance segmentation, which additionally segments out individual objects. To deal with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model. We apply the "detection head'" of Mask R-CNN on the predicted features to produce the instance segmentation of future frames. Experiments show that this approach significantly improves over strong baselines based on optical flow and repurposed instance segmentation architectures.
△ Less
Submitted 3 October, 2018; v1 submitted 30 March, 2018;
originally announced March 2018.
-
Predicting Deeper into the Future of Semantic Segmentation
Authors:
Pauline Luc,
Natalia Neverova,
Camille Couprie,
Jakob Verbeek,
Yann LeCun
Abstract:
The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we introduce the novel task o…
▽ More
The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we introduce the novel task of predicting semantic segmentations of future frames. Given a sequence of video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future. We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames. Our results on the Cityscapes dataset show that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames. Prediction results up to half a second in the future are visually convincing and are much more accurate than those of a baseline based on warping semantic segmentations using optical flow.
△ Less
Submitted 8 August, 2017; v1 submitted 22 March, 2017;
originally announced March 2017.
-
Semantic Segmentation using Adversarial Networks
Authors:
Pauline Luc,
Camille Couprie,
Soumith Chintala,
Jakob Verbeek
Abstract:
Adversarial training has been shown to produce state of the art results for generative image modeling. In this paper we propose an adversarial training approach to train semantic segmentation models. We train a convolutional semantic segmentation network along with an adversarial network that discriminates segmentation maps coming either from the ground truth or from the segmentation network. The…
▽ More
Adversarial training has been shown to produce state of the art results for generative image modeling. In this paper we propose an adversarial training approach to train semantic segmentation models. We train a convolutional semantic segmentation network along with an adversarial network that discriminates segmentation maps coming either from the ground truth or from the segmentation network. The motivation for our approach is that it can detect and correct higher-order inconsistencies between ground truth segmentation maps and the ones produced by the segmentation net. Our experiments show that our adversarial training approach leads to improved accuracy on the Stanford Background and PASCAL VOC 2012 datasets.
△ Less
Submitted 25 November, 2016;
originally announced November 2016.
-
Sur l'algébrisation des tissus de rang maximal
Authors:
Pirio Luc,
Trépreau Jean-Marie
Abstract:
We show that a web of codimension at least two and of maximal rank is isomorphic to an algebraic web. This solves a problem first consdered by Chern and Griffiths.
We show that a web of codimension at least two and of maximal rank is isomorphic to an algebraic web. This solves a problem first consdered by Chern and Griffiths.
△ Less
Submitted 13 February, 2013;
originally announced February 2013.
-
Study of a functional equation associated to the Kummer's equation of the trilogarihtm. Applications
Authors:
Pirio luc
Abstract:
In this paper we study a functional equation associated to the Kummer's equation (K) of the trilogarithm. Then we apply our results to web geometry and to characterize the functions solution of (K).
In this paper we study a functional equation associated to the Kummer's equation (K) of the trilogarithm. Then we apply our results to web geometry and to characterize the functions solution of (K).
△ Less
Submitted 18 June, 2002;
originally announced June 2002.