Search | arXiv e-print repository

arXiv:1901.04437 [pdf, ps, other]

doi 10.1080/00268976.2019.1622811

Quantum-State-Specific Reaction Rate Measurements for the Photo-induced Reaction Ca$^+$ + O$_2$ $\rightarrow$ CaO$^+$ + O

Authors: Philipp C. Schmid, Mikhail I. Miller, James Greenberg, Thanh L. Nguyen, John F. Stanton, H. J. Lewandowski

Abstract: Atoms and molecules often react at different rates depending on their internal quantum states. Thus, controlling which internal states are populated can be used to manipulate the reactivity and can lead to a more detailed understanding of reaction mechanisms. We demonstrate this control of reactions by studying the excited state reaction reaction Ca$^+$ + O$_2$ $\rightarrow$ CaO$^+$ + O. This reac… ▽ More Atoms and molecules often react at different rates depending on their internal quantum states. Thus, controlling which internal states are populated can be used to manipulate the reactivity and can lead to a more detailed understanding of reaction mechanisms. We demonstrate this control of reactions by studying the excited state reaction reaction Ca$^+$ + O$_2$ $\rightarrow$ CaO$^+$ + O. This reaction is exothermic only if Ca$^+$ is in one of its excited electronic states. Using laser-cooling and electrodynamic trapping, we cool and trap Ca$^+$ at millikevin temperatures for several minutes. We can then change the fraction of time they spend in each of the two excited states by adjusting the detunings of the cooling lasers. This allows us to disentangle the reactions that begin with Ca$^+$ in the $^2$P$_{1/2}$-state from the ones where Ca$^+$ is in the $^2$D$_{3/2}$-state. Using time-of-flight mass spectrometry, we determine independent reaction rate constants for Ca$^+$ in both electronically excited quantum states. △ Less

Submitted 14 January, 2019; originally announced January 2019.

Comments: 12 pages, 5 figures

arXiv:1901.01342 [pdf, other]

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Authors: Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru

Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made com… ▽ More Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset. △ Less

Submitted 24 May, 2019; v1 submitted 4 January, 2019; originally announced January 2019.

arXiv:1901.01091 [pdf, other]

Adaptive Density Estimation for Generative Models

Authors: Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

Abstract: Unsupervised learning of generative models has seen tremendous progress over recent years, in particular due to generative adversarial networks (GANs), variational autoencoders, and flow-based models. GANs have dramatically improved sample quality, but suffer from two drawbacks: (i) they mode-drop, i.e., do not cover the full support of the train data, and (ii) they do not allow for likelihood eva… ▽ More Unsupervised learning of generative models has seen tremendous progress over recent years, in particular due to generative adversarial networks (GANs), variational autoencoders, and flow-based models. GANs have dramatically improved sample quality, but suffer from two drawbacks: (i) they mode-drop, i.e., do not cover the full support of the train data, and (ii) they do not allow for likelihood evaluations on held-out data. In contrast, likelihood-based training encourages models to cover the full support of the train data, but yields poorer samples. These mutual shortcomings can in principle be addressed by training generative latent variable models in a hybrid adversarial-likelihood manner. However, we show that commonly made parametric assumptions create a conflict between them, making successful hybrid models non trivial. As a solution, we propose to use deep invertible transformations in the latent variable decoder. This approach allows for likelihood computations in image space, is more efficient than fully invertible models, and can take full advantage of adversarial training. We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores. △ Less

Submitted 3 January, 2020; v1 submitted 4 January, 2019; originally announced January 2019.

arXiv:1812.07673 [pdf, other]

Active learning for efficiently training emulators of computationally expensive mathematical models

Authors: Alexandra G. Ellis, Rowan Iskandar, Christopher H. Schmid, John B. Wong, Thomas A. Trikalinos

Abstract: An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as sensitivity analyses, policy optimization, model calibration, and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design… ▽ More An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as sensitivity analyses, policy optimization, model calibration, and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design points). Developing an emulator that closely approximates the simulator can require many design points, which becomes computationally expensive. We describe a self-terminating active learning algorithm to efficiently develop emulators tailored to a specific emulation task, and compare it with algorithms that optimize geometric criteria (random latin hypercube sampling and maximum projection designs) and other active learning algorithms (treed Gaussian Processes that optimize typical active learning criteria). We compared the algorithms' root mean square error (RMSE) and maximum absolute deviation from the simulator (MAX) for seven benchmark functions and in a prostate cancer screening model. In the empirical analyses, in simulators with greatly-varying smoothness over the input domain, active learning algorithms resulted in emulators with smaller RMSE and MAX for the same number of design points. In all other cases, all algorithms performed comparably. The proposed algorithm attained satisfactory performance in all analyses, had smaller variability than the treed Gaussian Processes (it is deterministic), and, on average, had similar or better performance as the treed Gaussian Processes in 6 out of 7 benchmark functions and in the prostate cancer model. △ Less

Submitted 3 January, 2020; v1 submitted 18 December, 2018; originally announced December 2018.

Comments: Counting appendix materials: 31 pages, 7 Figures, 3 Tables

arXiv:1812.05736 [pdf, other]

Detecting unseen visual relations using analogies

Authors: Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

Abstract: We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training. This is an important set-up due to the combinatorial nature of visual relations : collecting sufficient training data for all possible triplets would be ver… ▽ More We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training. This is an important set-up due to the combinatorial nature of visual relations : collecting sufficient training data for all possible triplets would be very hard. The contributions of this work are three-fold. First, we learn a representation of visual relations that combines (i) individual embeddings for subject, object and predicate together with (ii) a visual phrase embedding that represents the relation triplet. Second, we learn how to transfer visual phrase embeddings from existing training triplets to unseen test triplets using analogies between relations that involve similar objects. Third, we demonstrate the benefits of our approach on three challenging datasets : on HICO-DET, our model achieves significant improvement over a strong baseline for both frequent and unseen triplets, and we observe similar improvement for the retrieval of unseen triplets with out-of-vocabulary predicates on the COCO-a dataset as well as the challenging unusual triplets in the UnRel dataset. △ Less

Submitted 22 September, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

arXiv:1812.03544 [pdf, other]

A Structured Model For Action Detection

Authors: Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

Abstract: A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such cha… ▽ More A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such challenging problem - the models that need to be trained are large, and labeled data is expensive to obtain. To address this limitation, we propose to incorporate domain knowledge into the structure of the model, simplifying optimization. In particular, we augment a standard I3D network with a tracking module to aggregate long term motion patterns, and use a graph convolutional network to reason about interactions between actors and objects. Evaluated on the challenging AVA dataset, the proposed approach improves over the I3D baseline by 5.5% mAP and over the state-of-the-art by 4.8% mAP. △ Less

Submitted 5 June, 2019; v1 submitted 9 December, 2018; originally announced December 2018.

arXiv:1812.00025 [pdf, other]

Modulated Policy Hierarchies

Authors: Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid

Abstract: Solving tasks with sparse rewards is a main challenge in reinforcement learning. While hierarchical controllers are an intuitive approach to this problem, current methods often require manual reward shaping, alternating training phases, or manually defined sub tasks. We introduce modulated policy hierarchies (MPH), that can learn end-to-end to solve tasks from sparse rewards. To achieve this, we s… ▽ More Solving tasks with sparse rewards is a main challenge in reinforcement learning. While hierarchical controllers are an intuitive approach to this problem, current methods often require manual reward shaping, alternating training phases, or manually defined sub tasks. We introduce modulated policy hierarchies (MPH), that can learn end-to-end to solve tasks from sparse rewards. To achieve this, we study different modulation signals and exploration for hierarchical controllers. Specifically, we find that communicating via bit-vectors is more efficient than selecting one out of multiple skills, as it enables mixing between them. To facilitate exploration, MPH uses its different time scales for temporally extended intrinsic motivation at each level of the hierarchy. We evaluate MPH on the robotics tasks of pushing and sparse block stacking, where it outperforms recent baselines. △ Less

Submitted 30 November, 2018; originally announced December 2018.

Comments: 8 pages, 5 figures

arXiv:1809.06396 [pdf, other]

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Abstract: Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the related question of "membership inference", where the goal is to determine if an image was used during training. We consider it under three complementary angles. We show how to detect which dataset was used… ▽ More Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the related question of "membership inference", where the goal is to determine if an image was used during training. We consider it under three complementary angles. We show how to detect which dataset was used to train a model, and in particular whether some validation images were used at train time. We then analyze explicit memorization and extend classical random label experiments to the problem of learning a model that predicts if an image belongs to an arbitrary set. Finally, we propose a new approach to infer membership when a few of the top layers are not available or have been fine-tuned, and show that lower layers still carry information about the training samples. To support our findings, we conduct large-scale experiments on Imagenet and subsets of YFCC-100M with modern architectures such as VGG and Resnet. △ Less

Submitted 17 September, 2018; originally announced September 2018.

arXiv:1809.02492 [pdf, other]

On the Importance of Visual Context for Data Augmentation in Scene Understanding

Authors: Nikita Dvornik, Julien Mairal, Cordelia Schmid

Abstract: Performing data augmentation for learning deep neural networks is known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. While simple image transformations can already improve predictive performance in most vision tasks, larger gains can be obtained by leveraging task-spec… ▽ More Performing data augmentation for learning deep neural networks is known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. While simple image transformations can already improve predictive performance in most vision tasks, larger gains can be obtained by leveraging task-specific prior knowledge. In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations. We observe that randomly pasting objects on images hurts the performance, unless the object is placed in the right context. To resolve this issue, we propose an explicit context model by using a convolutional neural network, which predicts whether an image region is suitable for placing a given object or not. In our experiments, we show that our approach is able to improve object detection, semantic and instance segmentation on the PASCAL VOC12 and COCO datasets, with significant gains in a limited annotation scenario, i.e. when only one category is annotated. We also show that the method is not limited to datasets that come with expensive pixel-wise instance annotations and can be used when only bounding boxes are available, by employing weakly-supervised learning for instance masks approximation. △ Less

Submitted 19 September, 2019; v1 submitted 6 September, 2018; originally announced September 2018.

Comments: Updated the experimental section. arXiv admin note: substantial text overlap with arXiv:1807.07428

arXiv:1807.10982 [pdf, other]

Actor-Centric Relation Network

Authors: Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

Abstract: Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mi… ▽ More Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action. △ Less

Submitted 28 July, 2018; originally announced July 2018.

Comments: ECCV 2018 camera ready

arXiv:1807.09536 [pdf, other]

End-to-End Incremental Learning

Authors: Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

Abstract: Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally. This is due to current neural network architectures requiring the entire dataset, consisting of all the samples from the old as well as the new cla… ▽ More Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally. This is due to current neural network architectures requiring the entire dataset, consisting of all the samples from the old as well as the new classes, to update the model -a requirement that becomes easily unsustainable as the number of classes grows. We address this issue with our approach to learn deep neural networks incrementally, using new data and only a small exemplar set corresponding to samples from the old classes. This is based on a loss composed of a distillation measure to retain the knowledge acquired from the old classes, and a cross-entropy loss to learn the new classes. Our incremental training is achieved while keeping the entire framework end-to-end, i.e., learning the data representation and the classifier jointly, unlike recent methods with no such guarantees. We evaluate our method extensively on the CIFAR-100 and ImageNet (ILSVRC 2012) image classification datasets, and show state-of-the-art performance. △ Less

Submitted 3 September, 2018; v1 submitted 25 July, 2018; originally announced July 2018.

Comments: To appear in ECCV 2018

arXiv:1807.09499 [pdf, other]

How good is my GAN?

Authors: Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Abstract: Generative adversarial networks (GANs) are one of the most popular methods for generating images today. While impressive results have been validated by visual inspection, a number of quantitative criteria have emerged only recently. We argue here that the existing ones are insufficient and need to be in adequation with the task at hand. In this paper we introduce two measures based on image classi… ▽ More Generative adversarial networks (GANs) are one of the most popular methods for generating images today. While impressive results have been validated by visual inspection, a number of quantitative criteria have emerged only recently. We argue here that the existing ones are insufficient and need to be in adequation with the task at hand. In this paper we introduce two measures based on image classification---GAN-train and GAN-test, which approximate the recall (diversity) and precision (quality of the image) of GANs respectively. We evaluate a number of recent GAN approaches based on these two measures and demonstrate a clear difference in performance. Furthermore, we observe that the increasing difficulty of the dataset, from CIFAR10 over CIFAR100 to ImageNet, shows an inverse correlation with the quality of the GANs, as clearly evident from our measures. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: Accepted to ECCV2018

arXiv:1807.07428 [pdf, other]

Modeling Visual Context is Key to Augmenting Object Detection Datasets

Authors: Nikita Dvornik, Julien Mairal, Cordelia Schmid

Abstract: Performing data augmentation for learning deep neural networks is well known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. For object detection, classical approaches for data augmentation consist of generating images obtained by basic geometrical transformations and col… ▽ More Performing data augmentation for learning deep neural networks is well known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. For object detection, classical approaches for data augmentation consist of generating images obtained by basic geometrical transformations and color changes of original training images. In this work, we go one step further and leverage segmentation annotations to increase the number of object instances present on training data. For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment. Otherwise, we show that the previous strategy actually hurts. With our context model, we achieve significant mean average precision improvements when few labeled examples are available on the VOC'12 benchmark. △ Less

Submitted 19 July, 2018; originally announced July 2018.

Journal ref: ECCV2018, Sep 2018, Munich, Germany. 2018

arXiv:1807.02616 [pdf, other]

Effects of Predictive Real-Time Traffic Signal Information

Authors: Vadim Sokolov, David W. Etherington, Christian Schmid, Dominik Karbowski, Aymeric Rousseau, Muhammad Imran

Abstract: This paper analyzes the impact of providing car drivers with predictive information on traffic signal timing in real-time, including time-to-green and green-wave speed recommendations. Over a period of six months, the behavior of these 121 drivers in everyday urban driving was analyzed with and without access to live traffic signal information. In a first period, drivers had the information provid… ▽ More This paper analyzes the impact of providing car drivers with predictive information on traffic signal timing in real-time, including time-to-green and green-wave speed recommendations. Over a period of six months, the behavior of these 121 drivers in everyday urban driving was analyzed with and without access to live traffic signal information. In a first period, drivers had the information providing service disabled in order to establish a baseline behavior; after that initial phase, the service was activated. In both cases, data from smartphone and vehicle sensors was collected, including speed, acceleration, fuel rate, acceleration and brake pedal positions. We estimated the changes in the driving behavior which result from drivers' receiving the traffic signal timing information by carefully comparing distributions of acceleration/deceleration patterns through statistical analysis. Our analysis demonstrates that there is a positive effect of providing traffic signal information timing to the drivers. △ Less

Submitted 9 November, 2018; v1 submitted 7 July, 2018; originally announced July 2018.

arXiv:1806.11328 [pdf, other]

A flexible model for training action localization with varying levels of supervision

Authors: Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid

Abstract: Spatio-temporal action detection in videos is typically addressed in a fully-supervised setup with manual annotation of training videos required at every frame. Since such annotation is extremely tedious and prohibits scalability, there is a clear need to minimize the amount of manual supervision. In this work we propose a unifying framework that can handle and combine varying types of less-demand… ▽ More Spatio-temporal action detection in videos is typically addressed in a fully-supervised setup with manual annotation of training videos required at every frame. Since such annotation is extremely tedious and prohibits scalability, there is a clear need to minimize the amount of manual supervision. In this work we propose a unifying framework that can handle and combine varying types of less-demanding weak supervision. Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization. We investigate applications of such a model to training setups with alternative supervisory signals ranging from video-level class labels to the full per-frame annotation of action bounding boxes. Experiments on the challenging UCF101-24 and DALY datasets demonstrate competitive performance of our method at a fraction of supervision used by previous methods. The flexibility of our model enables joint learning from data with different levels of annotation. Experimental results demonstrate a significant gain by adding a few fully supervised examples to otherwise weakly labeled videos. △ Less

Submitted 27 November, 2018; v1 submitted 29 June, 2018; originally announced June 2018.

arXiv:1806.11008 [pdf, other]

Modeling Spatio-Temporal Human Track Structure for Action Localization

Authors: Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

Abstract: This paper addresses spatio-temporal localization of human actions in video. In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks. Our model is trained to simultaneously recognize and localize action classes in time and is based on two layer gated recurrent units (GRU) applied s… ▽ More This paper addresses spatio-temporal localization of human actions in video. In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks. Our model is trained to simultaneously recognize and localize action classes in time and is based on two layer gated recurrent units (GRU) applied separately to two streams, i.e. appearance and optical flow streams. When used together with state-of-the-art person detection and tracking, our model is shown to improve substantially spatio-temporal action localization in videos. The gain is shown to be mainly due to improved temporal localization. We evaluate our method on two recent datasets for spatio-temporal action localization, UCF101-24 and DALY, demonstrating a significant improvement of the state of the art. △ Less

Submitted 28 June, 2018; originally announced June 2018.

arXiv:1806.08652 [pdf, other]

doi 10.1051/0004-6361/201732119

Synthetic simulations of the extragalactic sky seen by eROSITA. I. Pre-launch selection functions from Monte-Carlo simulations

Authors: N. Clerc, M. E. Ramos-Ceja, J. Ridl, G. Lamer, H. Brunner, F. Hofmann, J. Comparat, F. Pacaud, F. Käfer, T. H. Reiprich, A. Merloni, C. Schmid, T. Brand, J. Wilms, P. Friedrich, A. Finoguenov, T. Dauser, I. Kreykenbohm

Abstract: Studies of galaxy clusters provide stringent constraints on models of structure formation. Provided that selection effects are under control, large X-ray surveys are well suited to derive cosmological parameters, in particular those governing the dark energy equation of state. We forecast the capabilities of the all-sky eROSITA (the extended ROentgen Survey with an Imaging Telescope Array) survey… ▽ More Studies of galaxy clusters provide stringent constraints on models of structure formation. Provided that selection effects are under control, large X-ray surveys are well suited to derive cosmological parameters, in particular those governing the dark energy equation of state. We forecast the capabilities of the all-sky eROSITA (the extended ROentgen Survey with an Imaging Telescope Array) survey to be achieved by the early 2020s. We bring special attention to modeling the entire chain from photon emission to source detection and cataloguing. The selection function of galaxy clusters for the upcoming eROSITA mission is investigated by means of extensive and dedicated Monte-Carlo simulations. Employing a combination of accurate instrument characterization and of state-of-the-art source detection technique, we determine a cluster detection efficiency based on the cluster fluxes and sizes. Using this eROSITA cluster selection function, we find that eROSITA will detect a total of $\sim 10^5$ clusters in the extra-galactic sky. This number of clusters will allow eROSITA to put stringent constraints on cosmological models. We show that incomplete assumptions on selection effects, such as neglecting the distribution of cluster sizes, induce a bias in the derived value of cosmological parameters. Synthetic simulations of the eROSITA sky capture the essential characteristics impacting the next-generation galaxy cluster surveys and they highlight parameters requiring tight monitoring in order to avoid biases in cosmological analyses. △ Less

Submitted 22 June, 2018; originally announced June 2018.

Comments: Accepted in A&A. Image quality degraded for arXiv submission

Journal ref: A&A 617, A92 (2018)

arXiv:1806.03198 [pdf, other]

Spreading vectors for similarity search

Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Abstract: Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods. State-of-the-art techniques learn parameters of quantizers on training data for optimal performance, thus adapting quantizers to the data. In this work, we propose to reverse this paradigm and adapt the data to the quantizer: we train a neural net which last layer forms a fixed parameter-free quanti… ▽ More Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods. State-of-the-art techniques learn parameters of quantizers on training data for optimal performance, thus adapting quantizers to the data. In this work, we propose to reverse this paradigm and adapt the data to the quantizer: we train a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere. As a proxy objective, we design and train a neural network that favors uniformity in the spherical latent space, while preserving the neighborhood structure after the mapping. We propose a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Experiments show that our end-to-end approach outperforms most learned quantization methods, and is competitive with the state of the art on widely adopted benchmarks. Furthermore, we show that training without the quantization step results in almost no difference in accuracy, but yields a generic catalyzer that can be applied with any subsequent quantizer. △ Less

Submitted 30 August, 2019; v1 submitted 8 June, 2018; originally announced June 2018.

Comments: Published at ICLR 2019

arXiv:1805.11155 [pdf, other]

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

Authors: Daan Wynen, Cordelia Schmid, Julien Mairal

Abstract: In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings. Our method is based on archetypal analysis, which is an unsupervised learning technique akin to sparse coding with a geometric interpretation. When applied to deep image representations from a collection of artworks, it learns a dic… ▽ More In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings. Our method is based on archetypal analysis, which is an unsupervised learning technique akin to sparse coding with a geometric interpretation. When applied to deep image representations from a collection of artworks, it learns a dictionary of archetypal styles, which can be easily visualized. After training the model, the style of a new image, which is characterized by local statistics of deep visual features, is approximated by a sparse convex combination of archetypes. This enables us to interpret which archetypal styles are present in the input image, and in which proportion. Finally, our approach allows us to manipulate the coefficients of the latent archetypal decomposition, and achieve various special effects such as style enhancement, transfer, and interpolation between multiple archetypes. △ Less

Submitted 2 October, 2018; v1 submitted 28 May, 2018; originally announced May 2018.

Comments: Accepted at NIPS 2018, Montréal, Canada

arXiv:1804.09627 [pdf, other]

Actor and Observer: Joint Modeling of First and Third-Person Videos

Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Abstract: Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a st… ▽ More Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a step in this direction, with the introduction of Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. This enables learning the link between the two, actor and observer perspectives. Thereby, we address one of the biggest bottlenecks facing egocentric vision research, providing a link from first-person to the abundant third-person data on the web. We use this data to learn a joint representation of first and third-person videos, with only weak supervision, and show its effectiveness for transferring knowledge from the third-person to the first-person domain. △ Less

Submitted 25 April, 2018; originally announced April 2018.

Comments: CVPR 2018 spotlight presentation

Journal ref: CVPR 2018

arXiv:1804.09626 [pdf, other]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Abstract: In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Chara… ▽ More In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances. Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data. △ Less

Submitted 30 April, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

arXiv:1804.04875 [pdf, other]

BodyNet: Volumetric Inference of 3D Human Body Shapes

Authors: Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid

Abstract: Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue f… ▽ More Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation. △ Less

Submitted 18 August, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

Comments: Appears in: European Conference on Computer Vision 2018 (ECCV 2018). 27 pages

arXiv:1803.00455 [pdf, other]

doi 10.1109/TPAMI.2019.2892985

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

Authors: Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid

Abstract: We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-R… ▽ More We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images. △ Less

Submitted 13 January, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

Comments: journal version of the CVPR 2017 paper, accepted to appear in IEEE Trans. PAMI

arXiv:1802.04216 [pdf, other]

Image-based Synthesis for Deep 3D Human Pose Estimation

Authors: Grégory Rogez, Cordelia Schmid

Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based sy… ▽ More This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a $K$-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage. △ Less

Submitted 12 February, 2018; originally announced February 2018.

Comments: accepted to appear in IJCV (with minor revisions). Follow-up to NIPS 2016 arXiv:1607.02046

arXiv:1712.01127 [pdf, other]

Learning to Segment Moving Objects

Authors: Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

Abstract: We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to c… ▽ More We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation. The framework is a two-stream neural network with an explicit memory module. The two streams encode appearance and motion cues in a video sequence respectively, while the memory module captures the evolution of objects over time, exploiting the temporal consistency. The motion stream is a convolutional neural network trained on synthetic videos to segment independently moving objects in the optical flow field. The module to build a 'visual memory' in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. For every pixel in a frame of a test video, our approach assigns an object or background label based on the learned spatio-temporal features as well as the 'visual memory' specific to the video. We evaluate our method extensively on three benchmarks, DAVIS, Freiburg-Berkeley motion segmentation dataset and SegTrack. In addition, we provide an extensive ablation study to investigate both the choice of the training data and the influence of each component in the proposed framework. △ Less

Submitted 1 December, 2017; originally announced December 2017.

Comments: arXiv admin note: text overlap with arXiv:1704.05737, arXiv:1612.07217

arXiv:1708.06977 [pdf, other]

Incremental Learning of Object Detectors without Catastrophic Forgetting

Authors: Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Abstract: Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i.e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data. They suffer from "catastrophic forgetting" - an abrupt degradation of performance on the original set of classes, when the traini… ▽ More Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i.e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data. They suffer from "catastrophic forgetting" - an abrupt degradation of performance on the original set of classes, when the training objective is adapted to the new classes. We present a method to address this issue, and learn object detectors incrementally, when neither the original training data nor annotations for the original classes in the new training set are available. The core of our proposed solution is a loss function to balance the interplay between predictions on the new classes and a new distillation loss which minimizes the discrepancy between responses for old classes from the original and the updated networks. This incremental learning can be performed multiple times, for a new set of classes in each step, with a moderate drop in performance compared to the baseline network trained on the ensemble of data. We present object detection results on the PASCAL VOC 2007 and COCO datasets, along with a detailed empirical analysis of the approach. △ Less

Submitted 23 August, 2017; originally announced August 2017.

Comments: To appear in ICCV 2017

arXiv:1708.02813 [pdf, other]

BlitzNet: A Real-Time Deep Network for Scene Understanding

Authors: Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, Cordelia Schmid

Abstract: Real-time scene understanding has become crucial in many applications such as autonomous driving. In this paper, we propose a deep architecture, called BlitzNet, that jointly performs object detection and semantic segmentation in one forward pass, allowing real-time computations. Besides the computational gain of having a single network to perform several tasks, we show that object detection and s… ▽ More Real-time scene understanding has become crucial in many applications such as autonomous driving. In this paper, we propose a deep architecture, called BlitzNet, that jointly performs object detection and semantic segmentation in one forward pass, allowing real-time computations. Besides the computational gain of having a single network to perform several tasks, we show that object detection and semantic segmentation benefit from each other in terms of accuracy. Experimental results for VOC and COCO datasets show state-of-the-art performance for object detection and segmentation among real time systems. △ Less

Submitted 9 August, 2017; originally announced August 2017.

arXiv:1708.02598 [pdf, other]

Exponential Random Graph Models with Big Networks: Maximum Pseudolikelihood Estimation and the Parametric Bootstrap

Authors: Christian S. Schmid, Bruce A. Desmarais

Abstract: With the growth of interest in network data across fields, the Exponential Random Graph Model (ERGM) has emerged as the leading approach to the statistical analysis of network data. ERGM parameter estimation requires the approximation of an intractable normalizing constant. Simulation methods represent the state-of-the-art approach to approximating the normalizing constant, leading to estimation b… ▽ More With the growth of interest in network data across fields, the Exponential Random Graph Model (ERGM) has emerged as the leading approach to the statistical analysis of network data. ERGM parameter estimation requires the approximation of an intractable normalizing constant. Simulation methods represent the state-of-the-art approach to approximating the normalizing constant, leading to estimation by Monte Carlo maximum likelihood (MCMLE). MCMLE is accurate when a large sample of networks is used to approximate the normalizing constant. However, MCMLE is computationally expensive, and may be prohibitively so if the size of the network is on the order of 1,000 nodes (i.e., one million potential ties) or greater. When the network is large, one option is maximum pseudolikelihood estimation (MPLE). The standard MPLE is simple and fast, but generally underestimates standard errors. We show that a resampling method---the parametric bootstrap---results in accurate coverage probabilities for confidence intervals. We find that bootstrapped MPLE can be run in 1/5th the time of MCMLE. We study the relative performance of MCMLE and MPLE with simulation studies, and illustrate the two different approaches by applying them to a network of bills introduced in the United State Senate. △ Less

Submitted 8 August, 2017; originally announced August 2017.

arXiv:1707.09472 [pdf, other]

Weakly-supervised learning of visual relations

Authors: Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

Abstract: This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configura… ▽ More This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset. △ Less

Submitted 29 July, 2017; originally announced July 2017.

arXiv:1707.07036 [pdf, ps, other]

doi 10.1063/1.4996911

High resolution ion trap time-of-flight mass spectrometer for cold trapped ion experiments

Authors: Philipp C. Schmid, James Greenberg, Mikhail I. Miller, Kevin Loeffler, Heather J. Lewandowski

Abstract: Trapping molecular ions that have been sympathetically cooled with laser-cooled atomic ions is a useful platform for exploring cold ion chemistry. We designed and characterized a new experimental apparatus for probing chemical reaction dynamics between molecular cations and neutral radicals at temperatures below 1 K. The ions are trapped in a linear quadrupole radio-frequency trap and sympathetica… ▽ More Trapping molecular ions that have been sympathetically cooled with laser-cooled atomic ions is a useful platform for exploring cold ion chemistry. We designed and characterized a new experimental apparatus for probing chemical reaction dynamics between molecular cations and neutral radicals at temperatures below 1 K. The ions are trapped in a linear quadrupole radio-frequency trap and sympathetically cooled by co-trapped, laser-cooled, atomic ions. The ion trap is coupled to a time-of-flight mass spectrometer to readily identify product ion species, as well as to accurately determine trapped ion numbers. We discuss, and present in detail, the design of this ion trap time-of-flight mass spectrometer, as well as the electronics required for driving the trap and mass spectrometer. Furthermore, we measure the performance of this system, which yields mass resolutions of $m/Δm \geq 1100$ over a wide mass range, and discuss its relevance for future measurements in chemical reaction kinetics and dynamics. △ Less

Submitted 21 July, 2017; originally announced July 2017.

Comments: 9 pages, 9 figures

Journal ref: Rev. Sci. Instrum. 88 (2017) 123107

arXiv:1707.06005 [pdf, other]

Detecting Parts for Action Localization

Authors: Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

Abstract: In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neur… ▽ More In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neural network which learns part proposals specific to certain body parts. These are then combined to detect people robustly in each frame. Our tracking algorithm connects the image detections temporally to extract full-body human tubes. We apply our new tube extraction method on the problem of human action localization, on the popular JHMDB dataset, and a very recent challenging dataset DALY (Daily Action Localization in YouTube), showing state-of-the-art results. △ Less

Submitted 21 July, 2017; v1 submitted 19 July, 2017; originally announced July 2017.

Comments: BMVC 2017

arXiv:1707.03993 [pdf]

Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition

Authors: Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin

Abstract: Landmark-based human action recognition in videos is a challenging task in computer vision. One key step is to design a generic approach that generates discriminative features for the spatial structure and temporal dynamics. To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and i… ▽ More Landmark-based human action recognition in videos is a challenging task in computer vision. One key step is to design a generic approach that generates discriminative features for the spatial structure and temporal dynamics. To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events. We do not extract signature features from the raw path, rather we propose path disintegrations and path transformations as preprocessing steps. Path disintegrations turn a high-dimensional path linearly into a collection of lower-dimensional paths; some of these paths are in pose space while others are defined over a multiscale collection of temporal intervals. Path transformations decorate the paths with additional coordinates in standard ways to allow the truncated signatures of transformed paths to expose additional features. For spatial representation, we apply the signature transform to vectorize the paths that arise out of pose disintegration, and for temporal representation, we apply it again to describe this evolving vectorization. Finally, all the features are collected together to constitute the input vector of a linear single-hidden-layer fully-connected network for classification. Experimental results on four datasets demonstrated that the proposed feature set with only a linear shallow network and Dropconnect is effective and achieves comparable state-of-the-art results to the advanced deep networks, and meanwhile, is capable of interpretation. △ Less

Submitted 12 December, 2019; v1 submitted 13 July, 2017; originally announced July 2017.

Comments: 14 pages, 11 figures

arXiv:1705.08421 [pdf, other]

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Authors: Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual… ▽ More This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. △ Less

Submitted 30 April, 2018; v1 submitted 23 May, 2017; originally announced May 2017.

Comments: To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for details

arXiv:1705.04043 [pdf, other]

SCNet: Learning Semantic Correspondence

Authors: Kai Han, Rafael S. Rezende, Bumsub Ham, Kwan-Yee K. Wong, Minsu Cho, Cordelia Schmid, Jean Ponce

Abstract: This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning… ▽ More This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning a geometrically plausible model for semantic correspondence. SCNet uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function. It is trained on image pairs obtained from the PASCAL VOC 2007 keypoint dataset, and a comparative evaluation on several standard benchmarks demonstrates that the proposed approach substantially outperforms both recent deep learning architectures and previous methods based on hand-crafted features. △ Less

Submitted 17 August, 2017; v1 submitted 11 May, 2017; originally announced May 2017.

Comments: ICCV 2017

arXiv:1705.01861 [pdf, other]

Action Tubelet Detector for Spatio-Temporal Action Localization

Authors: Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

Abstract: Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bou… ▽ More Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds. △ Less

Submitted 21 August, 2017; v1 submitted 4 May, 2017; originally announced May 2017.

Comments: 9 pages

arXiv:1704.07804 [pdf, other]

SfM-Net: Learning of Structure and Motion from Video

Authors: Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

Abstract: We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), different… ▽ More We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) self-supervised by the re-projection photometric error (completely unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided. △ Less

Submitted 25 April, 2017; originally announced April 2017.

arXiv:1704.05737 [pdf, other]

Learning Video Object Segmentation with Visual Memory

Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

Abstract: This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a "visual memory" in video, i.… ▽ More This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a "visual memory" in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given a video frame as input, our approach assigns each pixel an object or background label based on the learned spatio-temporal features as well as the "visual memory" specific to the video, acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. For example, our approach outperforms the top method on the DAVIS dataset by nearly 6%. We also provide an extensive ablative analysis to investigate the influence of each component in the proposed framework. △ Less

Submitted 12 July, 2017; v1 submitted 19 April, 2017; originally announced April 2017.

arXiv:1703.07144 [pdf, other]

Proposal Flow: Semantic Correspondences from Object Proposals

Authors: Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Abstract: Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout. Semantic flow methods are designed to handle images depicting different instances of the same object or scene category. We introduce a novel approach to semantic flow, dubbed proposal flow, that establishes reliable correspondences using object proposals. Unlike… ▽ More Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout. Semantic flow methods are designed to handle images depicting different instances of the same object or scene category. We introduce a novel approach to semantic flow, dubbed proposal flow, that establishes reliable correspondences using object proposals. Unlike prevailing semantic flow approaches that operate on pixels or regularly sampled local regions, proposal flow benefits from the characteristics of modern object proposals, that exhibit high repeatability at multiple scales, and can take advantage of both local and geometric consistency constraints among proposals. We also show that the corresponding sparse proposal flow can effectively be transformed into a conventional dense flow field. We introduce two new challenging datasets that can be used to evaluate both general semantic flow techniques and region-based approaches such as proposal flow. We use these benchmarks to compare different matching algorithms, object proposals, and region features within proposal flow, to the state of the art in semantic flow. This comparison, along with experiments on standard datasets, demonstrates that proposal flow significantly outperforms existing semantic flow methods in various settings. △ Less

Submitted 21 March, 2017; originally announced March 2017.

Comments: arXiv admin note: text overlap with arXiv:1511.05065

arXiv:1701.01370 [pdf, other]

doi 10.1109/CVPR.2017.492

Learning from Synthetic Humans

Authors: Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

Abstract: Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In… ▽ More Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data. △ Less

Submitted 19 January, 2018; v1 submitted 5 January, 2017; originally announced January 2017.

Comments: Appears in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). 9 pages

arXiv:1612.07217 [pdf, other]

Learning Motion Patterns in Videos

Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

Abstract: The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved. We address this challenging task by learning motion patterns in videos. The core of our approach is a fully convolutional network, which is learned entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation. This encoder-decoder style archite… ▽ More The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved. We address this challenging task by learning motion patterns in videos. The core of our approach is a fully convolutional network, which is learned entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation. This encoder-decoder style architecture first learns a coarse representation of the optical flow field features, and then refines it iteratively to produce motion labels at the original high-resolution. We further improve this labeling with an objectness map and a conditional random field, to account for errors in optical flow, and also to focus on moving "things" rather than "stuff". The output label of each pixel denotes whether it has undergone independent motion, i.e., irrespective of camera motion. We demonstrate the benefits of this learning framework on the moving object segmentation task, where the goal is to segment all objects in motion. Our approach outperforms the top method on the recently released DAVIS benchmark dataset, comprising real-world sequences, by 5.6%. We also evaluate on the Berkeley motion segmentation database, achieving state-of-the-art results. △ Less

Submitted 10 April, 2017; v1 submitted 21 December, 2016; originally announced December 2016.

arXiv:1612.02008 [pdf, other]

Little String Defects and Bala-Carter Theory

Authors: Nathan Haouzi, Christian Schmid

Abstract: We give a physical realization of the Bala-Carter labels that classify nilpotent orbits of semi-simple Lie algebras, for the case $\mathfrak{g}=A,D,E$. We start from type IIB string theory compactified on an $ADE$ singularity and study the six-dimensional (2,0) $\mathfrak{g}$-type little string on a Riemann surface with punctures. The defects are introduced as D-branes wrapping the 2-cycles of the… ▽ More We give a physical realization of the Bala-Carter labels that classify nilpotent orbits of semi-simple Lie algebras, for the case $\mathfrak{g}=A,D,E$. We start from type IIB string theory compactified on an $ADE$ singularity and study the six-dimensional (2,0) $\mathfrak{g}$-type little string on a Riemann surface with punctures. The defects are introduced as D-branes wrapping the 2-cycles of the singularity. At low energies, the little string becomes the (2,0) conformal field theory of type $\mathfrak{g}$. As an application, we derive the full list of $E_n$ little string defects, and their Bala-Carter label in the CFT limit. Furthermore, we investigate new relations between the quiver gauge theory describing the D-brane defects at low energies, and the weighted Dynkin diagrams of $\mathfrak{g}$. We also give a physical version of the dimension formula of a nilpotent orbit based on its weighted Dynkin diagram. △ Less

Submitted 16 December, 2016; v1 submitted 6 December, 2016; originally announced December 2016.

Comments: 51 pages, 9 figures, 3 longtables. v2: Added proof of dimension formula, minor changes

arXiv:1612.01033 [pdf, other]

Areas of Attention for Image Captioning

Authors: Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek

Abstract: We propose "Areas of Attention", a novel attention-based model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attention-based approaches that associate image regions only to the RNN state, our method allows a direct association between capti… ▽ More We propose "Areas of Attention", a novel attention-based model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attention-based approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. During training these associations are inferred from image-level captions, akin to weakly-supervised object detector training. These associations help to improve captioning by localizing the corresponding regions during testing. We also propose and compare different ways of generating attention areas: CNN activation grids, object proposals, and spatial transformers nets applied in a convolutional fashion. Spatial transformers give the best results. They allow for image specific attention areas, and can be trained jointly with the rest of the network. Our attention mechanism and spatial transformer attention areas together yield state-of-the-art results on the MSCOCO dataset.o meaningful latent semantic structure in the generated captions. △ Less

Submitted 25 August, 2017; v1 submitted 3 December, 2016; originally announced December 2016.

Comments: Accepted in ICCV 2017

arXiv:1609.02195 [pdf, ps, other]

Einstein's $R^{\hat{0} \hat{0}}$ equation for non-relativistic sources derived from Einstein's inertial motion and the Newtonian law for relative acceleration

Authors: Christoph Schmid

Abstract: With Einstein's inertial motion (free-falling and non-rotating relative to gyroscopes), geodesics for non-relativistic particles can intersect repeatedly, allowing one to compute the space-time curvature $R^{\hat{0} \hat{0}}$ exactly. Einstein's $R^{\hat{0} \hat{0}}$ for strong gravitational fields and for relativistic source-matter is identical with the Newtonian expression for the relative radia… ▽ More With Einstein's inertial motion (free-falling and non-rotating relative to gyroscopes), geodesics for non-relativistic particles can intersect repeatedly, allowing one to compute the space-time curvature $R^{\hat{0} \hat{0}}$ exactly. Einstein's $R^{\hat{0} \hat{0}}$ for strong gravitational fields and for relativistic source-matter is identical with the Newtonian expression for the relative radial acceleration of neighboring free-falling test-particles, spherically averaged.--- Einstein's field equations follow from Newtonian experiments, local Lorentz-covariance, and energy-momentum conservation combined with the Bianchi identity. △ Less

Submitted 7 September, 2016; originally announced September 2016.

Comments: 4 pages. arXiv admin note: text overlap with arXiv:1607.08661

arXiv:1608.07279 [pdf, other]

doi 10.1007/JHEP05(2017)082

Little String Origin of Surface Defects

Authors: Nathan Haouzi, Christian Schmid

Abstract: We derive the codimension-two defects of 4d $\mathcal{N} = 4$ Super Yang-Mills (SYM) theory from the (2, 0) little string. The origin of the little string is type IIB theory compactified on an ADE singularity. The defects are D-branes wrapping the 2-cycles of the singularity. We use this construction to make contact with the description of SYM defects due to Gukov and Witten [arXiv:hep-th/0612073]… ▽ More We derive the codimension-two defects of 4d $\mathcal{N} = 4$ Super Yang-Mills (SYM) theory from the (2, 0) little string. The origin of the little string is type IIB theory compactified on an ADE singularity. The defects are D-branes wrapping the 2-cycles of the singularity. We use this construction to make contact with the description of SYM defects due to Gukov and Witten [arXiv:hep-th/0612073]. Furthermore, we derive from a geometric perspective the complete nilpotent orbit classification of codimension-two defects, and the connection to ADE-type Toda CFT. The only data needed to specify the defects is a set of weights of the algebra obeying certain constraints, which we give explicitly. We highlight the differences between the defect classification in the little string theory and its (2, 0) CFT limit. △ Less

Submitted 18 December, 2016; v1 submitted 25 August, 2016; originally announced August 2016.

Comments: 64 pages, 18 figures. v2: Minor fixes and clarifications

arXiv:1607.08661 [pdf, ps, other]

Einstein's equations from Einstein's inertial motion and Newton's law for relative acceleration

Authors: Christoph Schmid

Abstract: We show that Einstein's $R^{\hat{0} \hat{0}}$ equation for nonrelativistic matter and strong gravitational fields is identical with Newton's equation for relative radial acceleration of neighbouring freefalling particles, spherically averaged. These laws are explicitely identical with primary observer's (1) space-time slicing by radial 4-geodesics, (2) radially parallel Local Ortho-Normal Bases, L… ▽ More We show that Einstein's $R^{\hat{0} \hat{0}}$ equation for nonrelativistic matter and strong gravitational fields is identical with Newton's equation for relative radial acceleration of neighbouring freefalling particles, spherically averaged. These laws are explicitely identical with primary observer's (1) space-time slicing by radial 4-geodesics, (2) radially parallel Local Ortho-Normal Bases, LONBs, (3) Riemann normal 3-coordinates. Hats on indices denote LONBs. General relativity follows from Newton's law of relative acceleration, Einstein's inertial motion, Lorentz covariance, and energy-momentum conservation combined with Bianchi identity. The gravitational field equation of Newton-Gauss and Einstein's $R^{\hat{0} \hat{0}}$ equation are identical and linear in gravitational field for an inertial primary observer.--- Einstein's equivalence between fictitious forces and gravitational forces is formulated as equivalence theorem in the equations of motion. With this, the gravitational field equation of 19th-century Newtonian physics and Einstein's equation for $R^{\hat{0} \hat{0}}$ are identical and bilinear in the gravitational forces for non-inertial primary observers.--- $R^{\hat{0} \hat{0}} = - div \vec{E}_g$ and $2 R^{\hat{i} \hat{0}} = - (curl \vec{B}_g)^{\hat{i}}$ hold exactly for inertial primary observers, if one uses our LONB's. The gravitational $\vec{E}_g, \vec{B}_g $ are measured exactly with quasistatic particles via $(d/dt) p_{\hat{i}}$ and $(d/dt) S_{\hat{i}}$ in correspondence with the electromagnetic $\vec{E}$ and $\vec{B}$. The $(\vec{E}_g, \vec{B}_g)$ are identical with the observer's Ricci connection along his worldline, $(ω_{\hat{a} \hat{b}})_{\hat{0}}$. △ Less

Submitted 28 July, 2016; originally announced July 2016.

Comments: 17 pages

arXiv:1607.02046 [pdf, other]

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

Authors: Grégory Rogez, Cordelia Schmid

Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based sy… ▽ More This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. △ Less

Submitted 28 October, 2016; v1 submitted 7 July, 2016; originally announced July 2016.

Comments: 9 pages, accepted to appear in NIPS 2016

arXiv:1606.00043 [pdf]

The Detailed Science Case for the Maunakea Spectroscopic Explorer: the Composition and Dynamics of the Faint Universe

Authors: Alan McConnachie, Carine Babusiaux, Michael Balogh, Simon Driver, Pat Côté, Helene Courtois, Luke Davies, Laura Ferrarese, Sarah Gallagher, Rodrigo Ibata, Nicolas Martin, Aaron Robotham, Kim Venn, Eva Villaver, Jo Bovy, Alessandro Boselli, Matthew Colless, Johan Comparat, Kelly Denny, Pierre-Alain Duc, Sara Ellison, Richard de Grijs, Mirian Fernandez-Lorenzo, Ken Freeman, Raja Guhathakurta , et al. (152 additional authors not shown)

Abstract: MSE is an 11.25m aperture observatory with a 1.5 square degree field of view that will be fully dedicated to multi-object spectroscopy. More than 3200 fibres will feed spectrographs operating at low (R ~ 2000 - 3500) and moderate (R ~ 6000) spectral resolution, and approximately 1000 fibers will feed spectrographs operating at high (R ~ 40000) resolution. MSE is designed to enable transformational… ▽ More MSE is an 11.25m aperture observatory with a 1.5 square degree field of view that will be fully dedicated to multi-object spectroscopy. More than 3200 fibres will feed spectrographs operating at low (R ~ 2000 - 3500) and moderate (R ~ 6000) spectral resolution, and approximately 1000 fibers will feed spectrographs operating at high (R ~ 40000) resolution. MSE is designed to enable transformational science in areas as diverse as tomographic mapping of the interstellar and intergalactic media; the in-situ chemical tagging of thick disk and halo stars; connecting galaxies to their large scale structure; measuring the mass functions of cold dark matter sub-halos in galaxy and cluster-scale hosts; reverberation mapping of supermassive black holes in quasars; next generation cosmological surveys using redshift space distortions and peculiar velocities. MSE is an essential follow-up facility to current and next generations of multi-wavelength imaging surveys, including LSST, Gaia, Euclid, WFIRST, PLATO, and the SKA, and is designed to complement and go beyond the science goals of other planned and current spectroscopic capabilities like VISTA/4MOST, WHT/WEAVE, AAT/HERMES and Subaru/PFS. It is an ideal feeder facility for E-ELT, TMT and GMT, and provides the missing link between wide field imaging and small field precision astronomy. MSE is optimized for high throughput, high signal-to-noise observations of the faintest sources in the Universe with high quality calibration and stability being ensured through the dedicated operational mode of the observatory. (abridged) △ Less

Submitted 31 May, 2016; originally announced June 2016.

Comments: 210 pages, 91 figures. Exposure draft. Appendices to the Detailed Science Case can be found at http://mse.cfht.hawaii.edu/docs/

arXiv:1605.05197 [pdf, other]

Human Action Localization with Sparse Spatial Supervision

Authors: Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid

Abstract: We introduce an approach for spatio-temporal human action localization using sparse spatial supervision. Our method leverages the large amount of annotated humans available today and extracts human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Given these high-quality human tubes and temporal supervision, we select positive and negative tubes with very… ▽ More We introduce an approach for spatio-temporal human action localization using sparse spatial supervision. Our method leverages the large amount of annotated humans available today and extracts human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Given these high-quality human tubes and temporal supervision, we select positive and negative tubes with very sparse spatial supervision, i.e., only one spatially annotated frame per instance. The selected tubes allow us to effectively learn a spatio-temporal action detector based on dense trajectories or CNNs. We conduct experiments on existing action localization benchmarks: UCF-Sports, J-HMDB and UCF-101. Our results show that our approach, despite using sparse spatial supervision, performs on par with methods using full supervision, i.e., one bounding box annotation per frame. To further validate our method, we introduce DALY (Daily Action Localization in YouTube), a dataset for realistic action localization in space and time. It contains high quality temporal and spatial annotations for 3.6k instances of 10 actions in 31 hours of videos (3.3M frames). It is an order of magnitude larger than existing datasets, with more diversity in appearance and long untrimmed videos. △ Less

Submitted 23 May, 2017; v1 submitted 17 May, 2016; originally announced May 2016.

arXiv:1604.04494 [pdf, other]

Long-term Temporal Convolutions for Action Recognition

Authors: Gül Varol, Ivan Laptev, Cordelia Schmid

Abstract: Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representatio… ▽ More Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%). △ Less

Submitted 2 June, 2017; v1 submitted 15 April, 2016; originally announced April 2016.

arXiv:1603.08841 [pdf, other]

Optimisation of the Read-out Electronics of Muon Drift-Tube Chambers for Very High Background Rates at HL-LHC and Future Colliders

Authors: Sebastian Nowak, Sergey Abovyan, Philipp Gadow, Katharina Ecker, David Fink, Markus Fras, Oliver Kortner, Hubert Kroha, Felix Mueller, Robert Richter, Clemens Schmid, Korbinian Schmidt-Sommerfeld, Yazhou Zhao

Abstract: In the ATLAS Muon Spectrometer, Monitored Drift Tube (MDT) chambers and sMDT chambers with half of the tube diameter of the MDTs are used for precision muon track reconstruction. The sMDT chambers are designed for operation at high counting rates due to neutron and gamma background irradiation expected for the HL-LHC and future hadron colliders. The existing MDT read-out electronics uses bipolar s… ▽ More In the ATLAS Muon Spectrometer, Monitored Drift Tube (MDT) chambers and sMDT chambers with half of the tube diameter of the MDTs are used for precision muon track reconstruction. The sMDT chambers are designed for operation at high counting rates due to neutron and gamma background irradiation expected for the HL-LHC and future hadron colliders. The existing MDT read-out electronics uses bipolar signal shaping which causes an undershoot of opposite polarity and same charge after a signal pulse. At high counting rates and short electronics dead time used for the sMDTs, signal pulses pile up on the undershoot of preceding background pulses leading to a reduction of the signal amplitude and a jitter in the drift time measurement and, therefore, to a degradation of drift tube efficiency and spatial resolution. In order to further increase the rate capability of sMDT tubes, baseline restoration can be used in the read-out electronics to suppress the pile-up effects. A discrete bipolar shaping circuit with baseline restoration has been developed and used for reading out sMDT tubes under irradiation with a 24 MBq 90Sr source. The measurements results show a substantial improvement of the performance of the sMDT tubes at high counting rates. △ Less

Submitted 29 March, 2016; originally announced March 2016.

Report number: MPP-2015-284

Showing 151–200 of 278 results for author: Schmid, C