-
TAPVid-3D: A Benchmark for Tracking Any Point in 3D
Authors:
Skanda Koppula,
Ignacio Rocco,
Yi Yang,
Joe Heyward,
João Carreira,
Andrew Zisserman,
Gabriel Brostow,
Carl Doersch
Abstract:
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS, three-dimensional point tracking has none. To this end, leveraging existing footage, we build a new benchmark for 3D point tracking featuring 4,000+ real-w…
▽ More
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS, three-dimensional point tracking has none. To this end, leveraging existing footage, we build a new benchmark for 3D point tracking featuring 4,000+ real-world videos, composed of three different data sources spanning a variety of object types, motion patterns, and indoor and outdoor environments. To measure performance on the TAP-3D task, we formulate a collection of metrics that extend the Jaccard-based metric used in TAP to handle the complexities of ambiguous depth scales across models, occlusions, and multi-track spatio-temporal smoothness. We manually verify a large sample of trajectories to ensure correct video annotations, and assess the current state of the TAP-3D task by constructing competitive baselines using existing tracking models. We anticipate this benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video. Code for dataset download, generation, and model evaluation is available at https://tapvid3d.github.io
△ Less
Submitted 27 August, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors:
Carl Doersch,
Pauline Luc,
Yi Yang,
Dilara Gokay,
Skanda Koppula,
Ankush Gupta,
Joseph Heyward,
Ignacio Rocco,
Ross Goroshin,
João Carreira,
Andrew Zisserman
Abstract:
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulat…
▽ More
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/
△ Less
Submitted 23 May, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Perception Test 2023: A Summary of the First Challenge And Outcome
Authors:
Joseph Heyward,
João Carreira,
Dima Damen,
Andrew Zisserman,
Viorica Pătrăucean
Abstract:
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio,…
▽ More
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio, and text modalities, and covering: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, and grounded video question-answering. We summarise in this report the task descriptions, metrics, baselines, and results.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Learning from One Continuous Video Stream
Authors:
João Carreira,
Michael King,
Viorica Pătrăucean,
Dilara Gokay,
Cătălin Ionescu,
Yi Yang,
Daniel Zoran,
Joseph Heyward,
Carl Doersch,
Yusuf Aytar,
Dima Damen,
Andrew Zisserman
Abstract:
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str…
▽ More
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.
△ Less
Submitted 28 March, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video
Authors:
Shashanka Venkataramanan,
Mamshad Nayeem Rizve,
João Carreira,
Yuki M. Asano,
Yannis Avrithis
Abstract:
Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution,…
▽ More
Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning.
Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.
△ Less
Submitted 23 May, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Disentangling photodoping, photoconductivity, and photosuperconductivity in the cuprates
Authors:
R. El Hage,
D. Sánchez-Manzano,
V. Humbert,
S. J. Carreira,
V. Rouco,
A Sander,
F. Cuellar,
K. Seurre,
A. Lagarrigue,
J. Briatico,
J. Trastoy,
J. Santamaría,
Javier E. Villegas
Abstract:
The normal-state conductivity and superconducting critical temperature of oxygen-deficient YBa2Cu3O7-x can be persistently enhanced by illumination. Strongly debated for years, the origin of those effects -- termed persistent photoconductivity (PPC) and photosuperconductivity (PPS) -- has remained an unsolved critical problem, whose comprehension may provide key insights to harness the origin of h…
▽ More
The normal-state conductivity and superconducting critical temperature of oxygen-deficient YBa2Cu3O7-x can be persistently enhanced by illumination. Strongly debated for years, the origin of those effects -- termed persistent photoconductivity (PPC) and photosuperconductivity (PPS) -- has remained an unsolved critical problem, whose comprehension may provide key insights to harness the origin of high-temperature superconductivity itself. Here we make essential steps toward understanding PPS. While the models proposed so far assume that it is caused by a carrier-density increase (photodoping) observed concomitantly, our experiments contradict such conventional belief: we demonstrate that it is instead linked to a photo-induced decrease of the electronic scattering rate. Furthermore, we find that the latter effect and photodoping are completely disconnected and originate from different microscopic mechanisms since they present different wavelength and oxygen-content dependencies as well as strikingly different relaxation dynamics. Besides helping disentangle photodoping, PPC, and PPS, our results provide new evidence for the intimate relation between critical temperature and scattering rate, a key ingredient in modern theories on high-temperature superconductivity.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Authors:
Carl Doersch,
Yi Yang,
Mel Vecerik,
Dilara Gokay,
Ankush Gupta,
Yusuf Aytar,
Joao Carreira,
Andrew Zisserman
Abstract:
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on loc…
▽ More
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.
△ Less
Submitted 30 August, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Authors:
Viorica Pătrăucean,
Lucas Smaira,
Ankush Gupta,
Adrià Recasens Continente,
Larisa Markeeva,
Dylan Banarse,
Skanda Koppula,
Joseph Heyward,
Mateusz Malinowski,
Yi Yang,
Carl Doersch,
Tatiana Matejovicova,
Yury Sulsky,
Antoine Miech,
Alex Frechette,
Hanna Klimczak,
Raphael Koster,
Junlin Zhang,
Stephanie Winkler,
Yusuf Aytar,
Simon Osindero,
Dima Damen,
Andrew Zisserman,
João Carreira
Abstract:
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning…
▽ More
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding.
Dataset, baseline code, and challenge server are available at https://github.com/deepmind/perception_test
△ Less
Submitted 30 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Chip-scale, CMOS-compatible, high energy passively Q-switched laser
Authors:
Neetesh Singh,
Jan Lorenzen,
Milan Sinobad,
Kai Wang,
Andreas C. Liapis,
Henry Frankis,
Stefanie Haugg,
Henry Francis,
Jose Carreira,
Michael Geiselmann,
Mahmoud A. Gaafar,
Tobias Herr,
Jonathan D. B. Bradley,
Zhipei Sun,
Sonia M Garcia-Blanco,
Franz X. Kartner
Abstract:
Chip-scale, high-energy optical pulse generation is becoming increasingly important as we expand activities into hard to reach areas such as space and deep ocean. Q-switching of the laser cavity is the best known technique for generating high-energy pulses, and typically such systems are in the realm of large bench-top solid-state lasers and fiber lasers, especially in the long wavelength range >1…
▽ More
Chip-scale, high-energy optical pulse generation is becoming increasingly important as we expand activities into hard to reach areas such as space and deep ocean. Q-switching of the laser cavity is the best known technique for generating high-energy pulses, and typically such systems are in the realm of large bench-top solid-state lasers and fiber lasers, especially in the long wavelength range >1.8 um, thanks to their large energy storage capacity. However, in integrated photonics, the very property of tight mode confinement, that enables a small form factor, becomes an impediment to high energy application due to small optical mode cross-section. In this work, we demonstrate complementary metal-oxide-semiconductor (CMOS) compatible, rare-earth gain based large mode area (LMA) passively Q-switched laser in a compact footprint. We demonstrate high on-chip output pulse energy of >150 nJ in single transverse fundamental mode in the eye-safe window (1.9 um), with a slope efficiency ~ 40% in a footprint of ~9 mm2. The high energy pulse generation demonstrated in this work is comparable or in many cases exceeds Q-switched fiber lasers. This bodes well for field applications in medicine and space.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Zorro: the masked multimodal transformer
Authors:
Adrià Recasens,
Jason Lin,
Joāo Carreira,
Drew Jaegle,
Luyu Wang,
Jean-baptiste Alayrac,
Pauline Luc,
Antoine Miech,
Lucas Smaira,
Ross Hemsley,
Andrew Zisserman
Abstract:
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in…
▽ More
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
△ Less
Submitted 22 February, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
TAP-Vid: A Benchmark for Tracking Any Point in a Video
Authors:
Carl Doersch,
Ankush Gupta,
Larisa Markeeva,
Adrià Recasens,
Lucas Smaira,
Yusuf Aytar,
João Carreira,
Andrew Zisserman,
Yi Yang
Abstract:
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation e…
▽ More
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
△ Less
Submitted 31 March, 2023; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Self-supervised video pretraining yields human-aligned visual representations
Authors:
Nikhil Parthasarathy,
S. M. Ali Eslami,
João Carreira,
Olivier J. Hénaff
Abstract:
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human pe…
▽ More
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
△ Less
Submitted 25 July, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Compressed Vision for Efficient Video Understanding
Authors:
Olivia Wiles,
Joao Carreira,
Iain Barr,
Andrew Zisserman,
Mateusz Malinowski
Abstract:
Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long v…
▽ More
Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods
Authors:
Skanda Koppula,
Yazhe Li,
Evan Shelhamer,
Andrew Jaegle,
Nikhil Parthasarathy,
Relja Arandjelovic,
João Carreira,
Olivier Hénaff
Abstract:
Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what…
▽ More
Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods for obtaining high accuracy on representative visual tasks? Given the availability of large datasets, this setting is often more relevant for both academic and industry labs alike. We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised). In a like-for-like fashion, we characterize their FLOP and CO$_2$ footprints, relative to their accuracy when transferred to a canonical image segmentation task. Our analysis reveals strong disparities in the computational efficiency of pre-training methods and their dependence on dataset quality. In particular, our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data. We therefore advocate for (1) paying closer attention to dataset curation and (2) reporting of accuracies in context of the total computational cost.
△ Less
Submitted 18 October, 2022; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Transframer: Arbitrary Frame Prediction with Generative Models
Authors:
Charlie Nash,
João Carreira,
Jacob Walker,
Iain Barr,
Andrew Jaegle,
Mateusz Malinowski,
Peter Battaglia
Abstract:
We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs s…
▽ More
We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.
△ Less
Submitted 9 May, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Object discovery and representation networks
Authors:
Olivier J. Hénaff,
Skanda Koppula,
Evan Shelhamer,
Daniel Zoran,
Andrew Jaegle,
Andrew Zisserman,
João Carreira,
Relja Arandjelović
Abstract:
The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategie…
▽ More
The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers this image structure by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.
△ Less
Submitted 27 July, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
HiP: Hierarchical Perceiver
Authors:
Joao Carreira,
Skanda Koppula,
Daniel Zoran,
Adria Recasens,
Catalin Ionescu,
Olivier Henaff,
Evan Shelhamer,
Relja Arandjelovic,
Matt Botvinick,
Oriol Vinyals,
Karen Simonyan,
Andrew Zisserman,
Andrew Jaegle
Abstract:
General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of l…
▽ More
General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.
△ Less
Submitted 3 November, 2022; v1 submitted 22 February, 2022;
originally announced February 2022.
-
General-purpose, long-context autoregressive modeling with Perceiver AR
Authors:
Curtis Hawthorne,
Andrew Jaegle,
Cătălina Cangea,
Sebastian Borgeaud,
Charlie Nash,
Mateusz Malinowski,
Sander Dieleman,
Oriol Vinyals,
Matthew Botvinick,
Ian Simon,
Hannah Sheahan,
Neil Zeghidour,
Jean-Baptiste Alayrac,
João Carreira,
Jesse Engel
Abstract:
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic…
▽ More
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
△ Less
Submitted 14 June, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Input-level Inductive Biases for 3D Reconstruction
Authors:
Wang Yifan,
Carl Doersch,
Relja Arandjelović,
João Carreira,
Andrew Zisserman
Abstract:
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases. In this paper we tackle 3D reconstruction using a domain agnostic architecture and study how instead to inject the same type of inductive biases directly as extra inputs to the model. This approach makes it possible to apply existing general models…
▽ More
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases. In this paper we tackle 3D reconstruction using a domain agnostic architecture and study how instead to inject the same type of inductive biases directly as extra inputs to the model. This approach makes it possible to apply existing general models, such as Perceivers, on this rich domain, without the need for architectural changes, while simultaneously maintaining data efficiency of bespoke models. In particular we study how to encode cameras, projective ray incidence and epipolar geometry as model inputs, and demonstrate competitive multi-view depth estimation performance on multiple benchmarks.
△ Less
Submitted 19 March, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
Towards Learning Universal Audio Representations
Authors:
Luyu Wang,
Pauline Luc,
Yan Wu,
Adria Recasens,
Lucas Smaira,
Andrew Brock,
Andrew Jaegle,
Jean-Baptiste Alayrac,
Sander Dieleman,
Joao Carreira,
Aaron van den Oord
Abstract:
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learni…
▽ More
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models do not generalize outside of their domains. We observe that more robust audio representations can be learned with the SimCLR objective; however, the model's transferability depends heavily on the model architecture. We find the Slowfast architecture is good at learning rich representations required by different domains, but its performance is affected by the normalization scheme. Based on these findings, we propose a novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance across all domains.
△ Less
Submitted 23 June, 2022; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Authors:
Andrew Jaegle,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Carl Doersch,
Catalin Ionescu,
David Ding,
Skanda Koppula,
Daniel Zoran,
Andrew Brock,
Evan Shelhamer,
Olivier Hénaff,
Matthew M. Botvinick,
Andrew Zisserman,
Oriol Vinyals,
Joāo Carreira
Abstract:
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data f…
▽ More
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
△ Less
Submitted 15 March, 2022; v1 submitted 30 July, 2021;
originally announced July 2021.
-
Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Authors:
Mateusz Malinowski,
Dimitrios Vytiniotis,
Grzegorz Swirszcz,
Viorica Patraucean,
Joao Carreira
Abstract:
How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and incr…
▽ More
How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.
△ Less
Submitted 12 July, 2021; v1 submitted 15 June, 2021;
originally announced June 2021.
-
Efficient Visual Pretraining with Contrastive Detection
Authors:
Olivier J. Hénaff,
Skanda Koppula,
Jean-Baptiste Alayrac,
Aaron van den Oord,
Oriol Vinyals,
João Carreira
Abstract:
Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks r…
▽ More
Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations. This objective extracts a rich learning signal per image, leading to state-of-the-art transfer accuracy on a variety of downstream tasks, while requiring up to 10x less pretraining. In particular, our strongest ImageNet-pretrained model performs on par with SEER, one of the largest self-supervised systems to date, which uses 1000x more pretraining data. Finally, our objective seamlessly handles pretraining on more complex images such as those in COCO, closing the gap with supervised transfer learning from COCO to PASCAL.
△ Less
Submitted 5 August, 2021; v1 submitted 19 March, 2021;
originally announced March 2021.
-
Perceiver: General Perception with Iterative Attention
Authors:
Andrew Jaegle,
Felix Gimeno,
Andrew Brock,
Andrew Zisserman,
Oriol Vinyals,
Joao Carreira
Abstract:
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. T…
▽ More
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
△ Less
Submitted 22 June, 2021; v1 submitted 4 March, 2021;
originally announced March 2021.
-
A Short Note on the Kinetics-700-2020 Human Action Dataset
Authors:
Lucas Smaira,
João Carreira,
Eric Noland,
Ellen Clancy,
Amy Wu,
Andrew Zisserman
Abstract:
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results…
▽ More
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Spin pumping in d-wave superconductor/ferromagnet hybrids
Authors:
S. J. Carreira,
D. Sanchez-Manzano,
M. -W. Yoo,
V. Rouco,
A. Sander,
J. Santamaría,
A. Anane,
J. E. Villegas
Abstract:
Spin-pumping across ferromagnet/superconductor (F/S) interfaces has attracted much attention lately. Yet the focus has been mainly on s-wave superconductors-based systems whereas (high-temperature) d-wave superconductors such as YBa2Cu3O7-d (YBCO) have received scarce attention despite their fundamental and technological interest. Here we use wideband ferromagnetic resonance to study spin-pumping…
▽ More
Spin-pumping across ferromagnet/superconductor (F/S) interfaces has attracted much attention lately. Yet the focus has been mainly on s-wave superconductors-based systems whereas (high-temperature) d-wave superconductors such as YBa2Cu3O7-d (YBCO) have received scarce attention despite their fundamental and technological interest. Here we use wideband ferromagnetic resonance to study spin-pumping effects in bilayers that combine a soft metallic Ni80Fe20 (Py) ferromagnet and YBCO. We evaluate the spin conductance in YBCO by analyzing the magnetization dynamics in Py. We find that the Gilbert damping exhibits a drastic drop as the heterostructures are cooled across the normal-superconducting transition and then, depending on the S/F interface morphology, either stays constant or shows a strong upturn. This unique behavior is explained considering quasiparticle density of states at the YBCO surface, and is a direct consequence of zero-gap nodes for particular directions in the momentum space. Besides showing the fingerprint of d-wave superconductivity in spin-pumping, our results demonstrate the potential of high-temperature superconductors for fine tuning of the magnetization dynamics in ferromagnets using k-space degrees of freedom of d-wave/F interfaces.
△ Less
Submitted 25 November, 2021; v1 submitted 7 September, 2020;
originally announced September 2020.
-
The AVA-Kinetics Localized Human Actions Video Dataset
Authors:
Ang Li,
Meghana Thotakuri,
David A. Ross,
João Carreira,
Alexander Vostrikov,
Andrew Zisserman
Abstract:
This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe…
▽ More
This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/
△ Less
Submitted 20 May, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
Visual Grounding in Video for Unsupervised Word Translation
Authors:
Gunnar A. Sigurdsson,
Jean-Baptiste Alayrac,
Aida Nematzadeh,
Lucas Smaira,
Mateusz Malinowski,
João Carreira,
Phil Blunsom,
Andrew Zisserman
Abstract:
There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc…
▽ More
There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.
△ Less
Submitted 26 March, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Sideways: Depth-Parallel Training of Video Models
Authors:
Mateusz Malinowski,
Grzegorz Swirszcz,
Joao Carreira,
Viorica Patraucean
Abstract:
We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input stream…
▽ More
We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input streams such as videos to develop a more efficient training scheme? Here, we explore an alternative to backpropagation; we overwrite network activations whenever new ones, i.e., from new frames, become available. Such a more gradual accumulation of information from both passes breaks the precise correspondence between gradients and activations, leading to theoretically more noisy weight updates. Counter-intuitively, we show that Sideways training of deep convolutional video networks not only still converges, but can also potentially exhibit better generalization compared to standard synchronized backpropagation.
△ Less
Submitted 30 March, 2020; v1 submitted 17 January, 2020;
originally announced January 2020.
-
Controllable Attention for Structured Layered Video Decomposition
Authors:
Jean-Baptiste Alayrac,
João Carreira,
Relja Arandjelović,
Andrew Zisserman
Abstract:
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion. We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its desig…
▽ More
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion. We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its design. This improves separation performance over previous general purpose networks for this task; (ii) we demonstrate that we can augment the architecture to leverage external cues such as audio for controllability and to help disambiguation; and (iii) we experimentally demonstrate the effectiveness of our approach and training procedure with controlled experiments while also showing that the proposed model can be successfully applied to real-word applications such as reflection removal and action recognition in cluttered scenes.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
Nanoscale magnetic and charge anisotropies at manganite interfaces
Authors:
Santiago J. Carreira,
Myriam H. Aguirre,
Javier Briatico,
Laura B. Steren
Abstract:
Strong correlated manganites are still under intense research owing to their complex phase diagrams in terms of the Sr-doping and their sensitivity to intrinsic and extrinsic structural deformations. Here, we performed x-ray absorption spectroscopy measurements of manganites bilayers to explore the effects that a local Sr-doping gradient produce on the charge and antiferromagnetic anisotropies. In…
▽ More
Strong correlated manganites are still under intense research owing to their complex phase diagrams in terms of the Sr-doping and their sensitivity to intrinsic and extrinsic structural deformations. Here, we performed x-ray absorption spectroscopy measurements of manganites bilayers to explore the effects that a local Sr-doping gradient produce on the charge and antiferromagnetic anisotropies. In order to gradually tune the Sr-doping level along the axis perpendicular to the samples we have grown a series of bilayers with different thicknesses of low-doped manganites (from 0 nm to 6 nm) deposited over a La0.7Sr0.3MnO3 metallic layer. This strategy permitted us to resolve with high accuracy the thickness region where the charge and spin anisotropies vary and the critical thickness tc over which the out of plane orbital asymmetry does not have any further modifications. We found that the antiferromagnetic spin axis points preferentially out of the sample plane regardless the capping layer thickness. However, it tilts partially into the sample plane far from this critical thickness, owing to the jointed contributions of the external structural strain and electron doping. Furthermore, we found that the doping level of the capping layer sensibly affects the critical thickness, giving clear evidence of the influence exerted by the electron doping on the orbital and magnetic configurations. These anisotropy changes induce subtle modifications on the domain reorientation of the La0.7Sr0.3MnO3, as evidenced from the magnetic hysteresis cycles.
△ Less
Submitted 3 August, 2019;
originally announced August 2019.
-
A Short Note on the Kinetics-700 Human Action Dataset
Authors:
Joao Carreira,
Eric Noland,
Chloe Hillier,
Andrew Zisserman
Abstract:
We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.
We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.
△ Less
Submitted 17 October, 2022; v1 submitted 15 July, 2019;
originally announced July 2019.
-
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Authors:
Eric Jonas,
Johann Schleier-Smith,
Vikram Sreekanti,
Chia-Che Tsai,
Anurag Khandelwal,
Qifan Pu,
Vaishaal Shankar,
Joao Carreira,
Karl Krauth,
Neeraja Yadwadkar,
Joseph E. Gonzalez,
Raluca Ada Popa,
Ion Stoica,
David A. Patterson
Abstract:
Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an acc…
▽ More
Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Just as the 2009 paper identified challenges for the cloud and predicted they would be addressed and that cloud use would accelerate, we predict these issues are solvable and that serverless computing will grow to dominate the future of cloud computing.
△ Less
Submitted 9 February, 2019;
originally announced February 2019.
-
Video Action Transformer Network
Authors:
Rohit Girdhar,
João Carreira,
Carl Doersch,
Andrew Zisserman
Abstract:
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people…
▽ More
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.
△ Less
Submitted 17 May, 2019; v1 submitted 6 December, 2018;
originally announced December 2018.
-
The Visual Centrifuge: Model-Free Layered Video Representations
Authors:
Jean-Baptiste Alayrac,
João Carreira,
Andrew Zisserman
Abstract:
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent as…
▽ More
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent assumptions on motion, lighting and shape. Here we propose a learning-based approach for multi-layered video representation: we introduce novel uncertainty-capturing 3D convolutional architectures and train them to separate blended videos. We show that these models then generalize to single videos, where they exhibit interesting abilities: color constancy, factoring out shadows and separating reflections. We present quantitative and qualitative results on real world videos.
△ Less
Submitted 4 April, 2019; v1 submitted 4 December, 2018;
originally announced December 2018.
-
A Short Note about Kinetics-600
Authors:
Joao Carreira,
Eric Noland,
Andras Banki-Horvath,
Chloe Hillier,
Andrew Zisserman
Abstract:
We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process so it uses multiple queries per class, with some of them in a language other than english -- portuguese. This paper details the changes between the two…
▽ More
We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process so it uses multiple queries per class, with some of them in a language other than english -- portuguese. This paper details the changes between the two versions of the dataset and includes a comprehensive set of statistics of the new version as well as baseline results using the I3D neural network architecture. The paper is a companion to the release of the ground truth labels for the public test set.
△ Less
Submitted 3 August, 2018;
originally announced August 2018.
-
A Better Baseline for AVA
Authors:
Rohit Girdhar,
João Carreira,
Carl Doersch,
Andrew Zisserman
Abstract:
We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model use…
▽ More
We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model used in the original AVA paper (which was pretrained on Kinetics and ImageNet), and up from 11.3 of the publicly available baseline using a ResNet101 image feature extractor, that was pretrained on ImageNet. Our final model obtains 22.8%/21.9% mAP on the val/test sets and outperforms all submissions to the AVA challenge at CVPR 2018.
△ Less
Submitted 26 July, 2018;
originally announced July 2018.
-
Massively Parallel Video Networks
Authors:
Joao Carreira,
Viorica Patraucean,
Laurent Mazare,
Andrew Zisserman,
Simon Osindero
Abstract:
We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. Th…
▽ More
We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. The models are still very deep, with dozens of such operations being performed but in a pipelined fashion that enables depth-parallel computation. We illustrate the proposed principles by applying them to existing image architectures and analyse their behaviour on two video tasks: action recognition and human keypoint localisation. The results show that a significant degree of parallelism, and implicitly speedup, can be achieved with little loss in performance.
△ Less
Submitted 5 September, 2018; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Tuning the Interfacial Charge, Orbital and Spin Polarization Properties in La0.67Sr0.33MnO3/ La1-xSrxMnO3 Bilayers
Authors:
Santiago J. Carreira,
Myriam H. Aguirre,
Javier Briatico,
Eugen Weschke,
Laura B. Steren
Abstract:
The possibility of controlling the interfacial properties of artificial oxide heterostructures is still attracting researchers in the field of materials engineering. Here, we used surface sensitive techniques and high-resolution transmission electron microscopy to investigate the evolution of the surface spin-polarization and lattice strains across the interfaces between La0.66Sr0.33MnO3 thin film…
▽ More
The possibility of controlling the interfacial properties of artificial oxide heterostructures is still attracting researchers in the field of materials engineering. Here, we used surface sensitive techniques and high-resolution transmission electron microscopy to investigate the evolution of the surface spin-polarization and lattice strains across the interfaces between La0.66Sr0.33MnO3 thin films and low-doped manganites as capping layers. We have been able to finely tune the interfacial spin-polarization by changing the capping layer thickness and composition. The spin-polarization was found to be highest at a critical capping thickness that depends on the Sr doping. We explain the non-trivial magnetic profile by the combined effect of two mechanisms. On one hand, the extra carriers supplied by the low-doped manganites that tend to compensate the overdoped interface, favouring locally a ferromagnetic double-exchange coupling. On the other hand, the evolution from a tensile-strained structure of the inner layers to a compressed structure at the surface that changes gradually the orbital occupation and hybridization of the 3d-Mn orbitals, being detrimental for the spin polarization. The finding of an intrinsic spin-polarization at the A-site cation observed in XMCD measurements reveals also the existence of a complex magnetic configuration at the interface, different from the magnetic phases observed at the inner layers.
△ Less
Submitted 10 December, 2017;
originally announced December 2017.
-
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Authors:
Joao Carreira,
Andrew Zisserman
Abstract:
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human…
▽ More
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.
△ Less
Submitted 12 February, 2018; v1 submitted 22 May, 2017;
originally announced May 2017.
-
The Kinetics Human Action Video Dataset
Authors:
Will Kay,
Joao Carreira,
Karen Simonyan,
Brian Zhang,
Chloe Hillier,
Sudheendra Vijayanarasimhan,
Fabio Viola,
Tim Green,
Trevor Back,
Paul Natsev,
Mustafa Suleyman,
Andrew Zisserman
Abstract:
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such…
▽ More
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.
△ Less
Submitted 19 May, 2017;
originally announced May 2017.
-
Shape and Symmetry Induction for 3D Objects
Authors:
Shubham Tulsiani,
Abhishek Kar,
Qixing Huang,
João Carreira,
Jitendra Malik
Abstract:
Actions as simple as grasping an object or navigating around it require a rich understanding of that object's 3D shape from a given viewpoint. In this paper we repurpose powerful learning machinery, originally developed for object classification, to discover image cues relevant for recovering the 3D shape of potentially unfamiliar objects. We cast the problem as one of local prediction of surface…
▽ More
Actions as simple as grasping an object or navigating around it require a rich understanding of that object's 3D shape from a given viewpoint. In this paper we repurpose powerful learning machinery, originally developed for object classification, to discover image cues relevant for recovering the 3D shape of potentially unfamiliar objects. We cast the problem as one of local prediction of surface normals and global detection of 3D reflection symmetry planes, which open the door for extrapolating occluded surfaces from visible ones. We demonstrate that our method is able to recover accurate 3D shape information for classes of objects it was not trained on, in both synthetic and real images.
△ Less
Submitted 24 November, 2015; v1 submitted 24 November, 2015;
originally announced November 2015.
-
Amodal Completion and Size Constancy in Natural Scenes
Authors:
Abhishek Kar,
Shubham Tulsiani,
João Carreira,
Jitendra Malik
Abstract:
We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image. There are several technical challenges to this, such as occlusions, lack of calibration data and the scale ambiguity between object size and distance. These have not been addressed in full generality in previous work. Here we propose to tackle these is…
▽ More
We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image. There are several technical challenges to this, such as occlusions, lack of calibration data and the scale ambiguity between object size and distance. These have not been addressed in full generality in previous work. Here we propose to tackle these issues by building upon advances in object recognition and using recently created large-scale datasets. We first introduce the task of amodal bounding box completion, which aims to infer the the full extent of the object instances in the image. We then propose a probabilistic framework for learning category-specific object size distributions from available annotations and leverage these in conjunction with amodal completion to infer veridical sizes in novel images. Finally, we introduce a focal length prediction approach that exploits scene recognition to overcome inherent scaling ambiguities and we demonstrate qualitative results on challenging real-world scenes.
△ Less
Submitted 1 October, 2015; v1 submitted 27 September, 2015;
originally announced September 2015.
-
Human Pose Estimation with Iterative Error Feedback
Authors:
Joao Carreira,
Pulkit Agrawal,
Katerina Fragkiadaki,
Jitendra Malik
Abstract:
Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved impressive performance on a variety of classification tasks using purely feedforward processing. Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimatio…
▽ More
Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved impressive performance on a variety of classification tasks using purely feedforward processing. Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimation or object segmentation. Here we propose a framework that expands the expressive power of hierarchical feature extractors to encompass both input and output spaces, by introducing top-down feedback. Instead of directly predicting the outputs in one go, we use a self-correcting model that progressively changes an initial solution by feeding back error predictions, in a process we call Iterative Error Feedback (IEF). IEF shows excellent performance on the task of articulated pose estimation in the challenging MPII and LSP benchmarks, matching the state-of-the-art without requiring ground truth scale annotation.
△ Less
Submitted 12 June, 2016; v1 submitted 23 July, 2015;
originally announced July 2015.
-
Learning to See by Moving
Authors:
Pulkit Agrawal,
Joao Carreira,
Jitendra Malik
Abstract:
The dominant paradigm for feature learning in computer vision relies on training neural networks for the task of object recognition using millions of hand labelled images. Is it possible to learn useful features for a diverse set of visual tasks using any other form of supervision? In biology, living organisms developed the ability of visual perception for the purpose of moving and acting in the w…
▽ More
The dominant paradigm for feature learning in computer vision relies on training neural networks for the task of object recognition using millions of hand labelled images. Is it possible to learn useful features for a diverse set of visual tasks using any other form of supervision? In biology, living organisms developed the ability of visual perception for the purpose of moving and acting in the world. Drawing inspiration from this observation, in this work we investigate if the awareness of egomotion can be used as a supervisory signal for feature learning. As opposed to the knowledge of class labels, information about egomotion is freely available to mobile agents. We show that given the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on visual tasks of scene recognition, object recognition, visual odometry and keypoint matching.
△ Less
Submitted 14 September, 2015; v1 submitted 7 May, 2015;
originally announced May 2015.
-
Pose Induction for Novel Object Categories
Authors:
Shubham Tulsiani,
João Carreira,
Jitendra Malik
Abstract:
We address the task of predicting pose for objects of unannotated object categories from a small seed set of annotated object classes. We present a generalized classifier that can reliably induce pose given a single instance of a novel category. In case of availability of a large collection of novel instances, our approach then jointly reasons over all instances to improve the initial estimates. W…
▽ More
We address the task of predicting pose for objects of unannotated object categories from a small seed set of annotated object classes. We present a generalized classifier that can reliably induce pose given a single instance of a novel category. In case of availability of a large collection of novel instances, our approach then jointly reasons over all instances to improve the initial estimates. We empirically validate the various components of our algorithm and quantitatively show that our method produces reliable pose estimates. We also show qualitative results on a diverse set of classes and further demonstrate the applicability of our system for learning shape models of novel object classes.
△ Less
Submitted 28 September, 2015; v1 submitted 30 April, 2015;
originally announced May 2015.
-
Lifting Object Detection Datasets into 3D
Authors:
Joao Carreira,
Sara Vicente,
Lourdes Agapito,
Jorge Batista
Abstract:
While data has certainly taken the center stage in computer vision in recent years, it can still be difficult to obtain in certain scenarios. In particular, acquiring ground truth 3D shapes of objects pictured in 2D images remains a challenging feat and this has hampered progress in recognition-based object reconstruction from a single image. Here we propose to bypass previous solutions such as 3D…
▽ More
While data has certainly taken the center stage in computer vision in recent years, it can still be difficult to obtain in certain scenarios. In particular, acquiring ground truth 3D shapes of objects pictured in 2D images remains a challenging feat and this has hampered progress in recognition-based object reconstruction from a single image. Here we propose to bypass previous solutions such as 3D scanning or manual design, that scale poorly, and instead populate object category detection datasets semi-automatically with dense, per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground truth figure-ground segmentations and (iii) a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion and then reconstructs object shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions. The visual hull sampling process attempts to intersect an object's projection cone with the cones of minimal subsets of other similar objects among those pictured from certain vantage points. We show that our method is able to produce convincing per-object 3D reconstructions and to accurately estimate cameras viewpoints on one of the most challenging existing object-category detection datasets, PASCAL VOC. We hope that our results will re-stimulate interest on joint object recognition and 3D reconstruction from a single image.
△ Less
Submitted 31 July, 2016; v1 submitted 22 March, 2015;
originally announced March 2015.
-
Virtual View Networks for Object Reconstruction
Authors:
João Carreira,
Abhishek Kar,
Shubham Tulsiani,
Jitendra Malik
Abstract:
All that structure from motion algorithms "see" are sets of 2D points. We show that these impoverished views of the world can be faked for the purpose of reconstructing objects in challenging settings, such as from a single image, or from a few ones far apart, by recognizing the object and getting help from a collection of images of other objects from the same class. We synthesize virtual views by…
▽ More
All that structure from motion algorithms "see" are sets of 2D points. We show that these impoverished views of the world can be faked for the purpose of reconstructing objects in challenging settings, such as from a single image, or from a few ones far apart, by recognizing the object and getting help from a collection of images of other objects from the same class. We synthesize virtual views by computing geodesics on novel networks connecting objects with similar viewpoints, and introduce techniques to increase the specificity and robustness of factorization-based object reconstruction in this setting. We report accurate object shape reconstruction from a single image on challenging PASCAL VOC data, which suggests that the current domain of applications of rigid structure-from-motion techniques may be significantly extended.
△ Less
Submitted 22 November, 2014;
originally announced November 2014.
-
Category-Specific Object Reconstruction from a Single Image
Authors:
Abhishek Kar,
Shubham Tulsiani,
João Carreira,
Jitendra Malik
Abstract:
Object reconstruction from a single image -- in the wild -- is a problem where we can make progress and get meaningful results today. This is the main message of this paper, which introduces an automated pipeline with pixels as inputs and 3D surfaces of various rigid categories as outputs in images of realistic scenes. At the core of our approach are deformable 3D models that can be learned from 2…
▽ More
Object reconstruction from a single image -- in the wild -- is a problem where we can make progress and get meaningful results today. This is the main message of this paper, which introduces an automated pipeline with pixels as inputs and 3D surfaces of various rigid categories as outputs in images of realistic scenes. At the core of our approach are deformable 3D models that can be learned from 2D annotations available in existing object detection datasets, that can be driven by noisy automatic object segmentations and which we complement with a bottom-up module for recovering high-frequency shape details. We perform a comprehensive quantitative analysis and ablation study of our approach using the recently introduced PASCAL 3D+ dataset and show very encouraging automatic reconstructions on PASCAL VOC.
△ Less
Submitted 6 May, 2015; v1 submitted 21 November, 2014;
originally announced November 2014.
-
Vortex pinning vs superconducting wire network: origin of periodic oscillations induced by applied magnetic fields in superconducting films with arrays of nanomagnets
Authors:
A. Gomez,
J. del Valle,
E. M. Gonzalez,
C. E. Chiliotte,
S. J. Carreira,
V. Bekeris,
J. L. Prieto,
Ivan K. Schuller,
J. L. Vicent
Abstract:
Hybrid magnetic arrays embedded in superconducting films are ideal systems to study the competition between different physical (such as the coherence length) and structural length scales such as available in artificially produced structures. This interplay leads to oscillation in many magnetically dependent superconducting properties such as the critical currents, resistivity and magnetization. Th…
▽ More
Hybrid magnetic arrays embedded in superconducting films are ideal systems to study the competition between different physical (such as the coherence length) and structural length scales such as available in artificially produced structures. This interplay leads to oscillation in many magnetically dependent superconducting properties such as the critical currents, resistivity and magnetization. These effects are generally analyzed using two distinct models based on vortex pinning or wire network. In this work, we show that for magnetic dot arrays, as opposed to antidot (i.e holes) arrays, vortex pinning is the main mechanism for field induced oscillations in resistance R(H), critical current Ic(H), magnetization M(H) and ac-susceptibility Xac(H) in a broad temperature range. Due to the coherence length divergence at Tc, a crossover to wire network behavior is experimentally found. While pinning occurs in a wide temperature range up to Tc, wire network behavior is only present in a very narrow temperature window close to Tc. In this temperature interval, contributions from both mechanisms are operational but can be experimentally distinguished.
△ Less
Submitted 20 May, 2014; v1 submitted 26 November, 2013;
originally announced November 2013.