Search | arXiv e-print repository

Deep Continuous Networks

Authors: Nergis Tomen, Silvia L. Pintea, Jan C. van Gemert

Abstract: CNNs and computational models of biological vision share some fundamental principles, which opened new avenues of research. However, fruitful cross-field research is hampered by conventional CNN architectures being based on spatially and depthwise discrete representations, which cannot accommodate certain aspects of biological complexity such as continuously varying receptive field sizes and dynam… ▽ More CNNs and computational models of biological vision share some fundamental principles, which opened new avenues of research. However, fruitful cross-field research is hampered by conventional CNN architectures being based on spatially and depthwise discrete representations, which cannot accommodate certain aspects of biological complexity such as continuously varying receptive field sizes and dynamics of neuronal responses. Here we propose deep continuous networks (DCNs), which combine spatially continuous filters, with the continuous depth framework of neural ODEs. This allows us to learn the spatial support of the filters during training, as well as model the continuous evolution of feature maps, linking DCNs closely to biological models. We show that DCNs are versatile and highly applicable to standard image classification and reconstruction problems, where they improve parameter and data efficiency, and allow for meta-parametrization. We illustrate the biological plausibility of the scale distributions learned by DCNs and explore their performance in a neuroscientifically inspired pattern completion task. Finally, we investigate an efficient implementation of DCNs by changing input contrast. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Presented at ICML 2021

Journal ref: In International Conference on Machine Learning 2021 Jul 1 (pp. 10324-10335). PMLR

arXiv:2308.10603 [pdf, other]

A step towards understanding why classification helps regression

Authors: Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert

Abstract: A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbal… ▽ More A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV-2023

arXiv:2308.05533 [pdf, other]

Is there progress in activity progress prediction?

Authors: Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea

Abstract: Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estima… ▽ More Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted at ICCVw-2023 (AI for Creative Video Editing and Understanding, ICCV workshop 2023)

arXiv:2308.04770 [pdf, other]

Objects do not disappear: Video object detection by single-frame object location anticipation

Authors: Xin Liu, Fatemeh Karimi Nejadasl, Jan C. van Gemert, Olaf Booij, Silvia L. Pintea

Abstract: Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neig… ▽ More Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV 2023

arXiv:2203.08586 [pdf, other]

Deep vanishing point detection: Geometric priors make dataset variations vanish

Authors: Yancong Lin, Ruben Wiersma, Silvia L. Pintea, Klaus Hildebrandt, Elmar Eisemann, Jan C. van Gemert

Abstract: Deep learning has improved vanishing point detection in images. Yet, deep networks require expensive annotated datasets trained on costly hardware and do not generalize to even slightly different domains, and minor problem variants. Here, we address these issues by injecting deep vanishing point detection networks with prior knowledge. This prior knowledge no longer needs to be learned from data,… ▽ More Deep learning has improved vanishing point detection in images. Yet, deep networks require expensive annotated datasets trained on costly hardware and do not generalize to even slightly different domains, and minor problem variants. Here, we address these issues by injecting deep vanishing point detection networks with prior knowledge. This prior knowledge no longer needs to be learned from data, saving valuable annotation efforts and compute, unlocking realistic few-sample scenarios, and reducing the impact of domain changes. Moreover, the interpretability of the priors allows to adapt deep networks to minor problem variations such as switching between Manhattan and non-Manhattan worlds. We seamlessly incorporate two geometric priors: (i) Hough Transform -- mapping image pixels to straight lines, and (ii) Gaussian sphere -- mapping lines to great circles whose intersections denote vanishing points. Experimentally, we ablate our choices and show comparable accuracy to existing models in the large-data setting. We validate our model's improved data efficiency, robustness to domain changes, adaptability to non-Manhattan settings. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: CVPR2022, code available at https://github.com/yanconglin/VanishingPoint_HoughTransform_GaussianSphere

arXiv:2112.12579 [pdf, other]

NeRD++: Improved 3D-mirror symmetry learning from a single image

Authors: Yancong Lin, Silvia-Laura Pintea, Jan van Gemert

Abstract: Many objects are naturally symmetric, and this symmetry can be exploited to infer unseen 3D properties from a single 2D image. Recently, NeRD is proposed for accurate 3D mirror plane estimation from a single image. Despite the unprecedented accuracy, it relies on large annotated datasets for training and suffers from slow inference. Here we aim to improve its data and compute efficiency. We do awa… ▽ More Many objects are naturally symmetric, and this symmetry can be exploited to infer unseen 3D properties from a single 2D image. Recently, NeRD is proposed for accurate 3D mirror plane estimation from a single image. Despite the unprecedented accuracy, it relies on large annotated datasets for training and suffers from slow inference. Here we aim to improve its data and compute efficiency. We do away with the computationally expensive 4D feature volumes and instead explicitly compute the feature correlation of the pixel correspondences across depth, thus creating a compact 3D volume. We also design multi-stage spherical convolutions to identify the optimal mirror plane on the hemisphere, whose inductive bias offers gains in data-efficiency. Experiments on both synthetic and real-world datasets show the benefit of our proposed changes for improved data efficiency and inference speed. △ Less

Submitted 7 October, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: BMVC 2022

arXiv:2112.03406 [pdf, other]

Equal Bits: Enforcing Equally Distributed Binary Network Weights

Authors: Yunqiang Li, Silvia L. Pintea, Jan C. van Gemert

Abstract: Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cann… ▽ More Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cannot precisely control the binary weight distribution during training, and therefore cannot guarantee maximum entropy. Here, we show that quantizing using optimal transport can guarantee any bit ratio, including equal ratios. We investigate experimentally that equal bit ratios are indeed preferable and show that our method leads to optimization benefits. We show that our quantization method is effective when compared to state-of-the-art binarization methods, even when using binary weight pruning. △ Less

Submitted 6 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

arXiv:2111.06660 [pdf, other]

Frequency learning for structured CNN filters with Gaussian fractional derivatives

Authors: Nikhil Saldanha, Silvia L. Pintea, Jan C. van Gemert, Nergis Tomen

Abstract: Frequency information lies at the base of discriminating between textures, and therefore between different objects. Classical CNN architectures limit the frequency learning through fixed filter sizes, and lack a way of explicitly controlling it. Here, we build on the structured receptive field filters with Gaussian derivative basis. Yet, rather than using predetermined derivative orders, which typ… ▽ More Frequency information lies at the base of discriminating between textures, and therefore between different objects. Classical CNN architectures limit the frequency learning through fixed filter sizes, and lack a way of explicitly controlling it. Here, we build on the structured receptive field filters with Gaussian derivative basis. Yet, rather than using predetermined derivative orders, which typically result in fixed frequency responses for the basis functions, we learn these. We show that by learning the order of the basis we can accurately learn the frequency of the filters, and hence adapt to the optimal frequencies for the underlying learning task. We investigate the well-founded mathematical formulation of fractional derivatives to adapt the filter frequencies during training. Our formulation leads to parameter savings and data efficiency when compared to the standard CNNs and the Gaussian derivative CNN filter networks that we build upon. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: Accepted at BMVC 2021

arXiv:2106.05094 [pdf, other]

Semi-supervised lane detection with Deep Hough Transform

Authors: Yancong Lin, Silvia-Laura Pintea, Jan van Gemert

Abstract: Current work on lane detection relies on large manually annotated datasets. We reduce the dependency on annotations by leveraging massive cheaply available unlabelled data. We propose a novel loss function exploiting geometric knowledge of lanes in Hough space, where a lane can be identified as a local maximum. By splitting lanes into separate channels, we can localize each lane via simple global… ▽ More Current work on lane detection relies on large manually annotated datasets. We reduce the dependency on annotations by leveraging massive cheaply available unlabelled data. We propose a novel loss function exploiting geometric knowledge of lanes in Hough space, where a lane can be identified as a local maximum. By splitting lanes into separate channels, we can localize each lane via simple global max-pooling. The location of the maximum encodes the layout of a lane, while the intensity indicates the the probability of a lane being present. Maximizing the log-probability of the maximal bins helps neural networks find lanes without labels. On the CULane and TuSimple datasets, we show that the proposed Hough Transform loss improves performance significantly by learning from large amounts of unlabelled images. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: ICIP2021

arXiv:2106.03412 [pdf, other]

doi 10.1109/TIP.2021.3115001

Resolution learning in deep convolutional networks using scale-space theory

Authors: Silvia L. Pintea, Nergis Tomen, Stanley F. Goes, Marco Loog, Jan C. van Gemert

Abstract: Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome.… ▽ More Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome. We propose to do away with hard-coded resolution hyper-parameters and aim to learn the appropriate resolution from data. We use scale-space theory to obtain a self-similar parametrization of filters and make use of the N-Jet: a truncated Taylor series to approximate a filter by a learned combination of Gaussian derivative filters. The parameter sigma of the Gaussian basis controls both the amount of detail the filter encodes and the spatial extent of the filter. Since sigma is a continuous parameter, we can optimize it with respect to the loss. The proposed N-Jet layer achieves comparable performance when used in state-of-the art architectures, while learning the correct resolution in each layer automatically. We evaluate our N-Jet layer on both classification and segmentation, and we show that learning sigma is especially beneficial for inputs at multiple sizes. △ Less

Submitted 24 October, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Preprint accepted by IEEE Transactions on Image Processing, 2021 (TIP). Link to final published article: https://ieeexplore.ieee.org/abstract/document/9552550

Journal ref: IEEE Transactions on Image Processing, vol. 30, pp. 8342-8353, 2021

arXiv:2103.15395 [pdf, other]

No frame left behind: Full Video Action Recognition

Authors: Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C. van Gemert

Abstract: Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make… ▽ More Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: Accepted to CVPR 2021

arXiv:2007.09493 [pdf, other]

Deep Hough-Transform Line Priors

Authors: Yancong Lin, Silvia L. Pintea, Jan C. van Gemert

Abstract: Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on th… ▽ More Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data. Keywords: Hough transform; global line prior, line segment detection. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: ECCV 2020, code online: https://github.com/yanconglin/Deep-Hough-Transform-Line-Priors

arXiv:2004.07629 [pdf, other]

Top-Down Networks: A coarse-to-fine reimagination of CNNs

Authors: Ioannis Lelekas, Nergis Tomen, Silvia L. Pintea, Jan C. van Gemert

Abstract: Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this… ▽ More Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: CVPR Workshop Deep Vision 2020

arXiv:2003.03528 [pdf]

Exploratory Study: Children's with Autism Awareness of being Imitated by Nao Robot

Authors: Andreea Peca, Adriana Tapus, Amir Aly, Cristina Pop, Lavinia Jisa, Sebastian Pintea, Alina Rusu, Daniel David

Abstract: This paper presents an exploratory study designed for children with Autism Spectrum Disorders (ASD) that investigates children's awareness of being imitated by a robot in a play/game scenario. The Nao robot imitates all the arm movement behaviors of the child in real-time in dyadic and triadic interactions. Different behavioral criteria (i.e., eye gaze, gaze shifting, initiation and imitation of a… ▽ More This paper presents an exploratory study designed for children with Autism Spectrum Disorders (ASD) that investigates children's awareness of being imitated by a robot in a play/game scenario. The Nao robot imitates all the arm movement behaviors of the child in real-time in dyadic and triadic interactions. Different behavioral criteria (i.e., eye gaze, gaze shifting, initiation and imitation of arm movements, smile/laughter) were analyzed based on the video data of the interaction. The results confirm only parts of the research hypothesis. However, these results are promising for the future directions of this work. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Comments: Proceedings of the 1st International Conference on Innovative Technologies for Autism Spectrum Disorders. ASD: Tools, Trends and Testimonials (ITASD), Spain, 2012

arXiv:2002.12360 [pdf]

Social Engagement of Children with Autism during Interaction with a Robot

Authors: Adriana Tapus, Andreea Peca, Amir Aly, Cristina Pop, Lavinia Jisa, Sebastian Pintea, Alina Rusu, Daniel David

Abstract: Imitation plays an important role in development, being one of the precursors of social cognition. Even though some children with autism imitate spontaneously and other children with autism can learn to imitate, the dynamics of imitation is affected in the large majority of cases. Existing studies from the literature suggest that robots can be used to teach children with autism basic interaction s… ▽ More Imitation plays an important role in development, being one of the precursors of social cognition. Even though some children with autism imitate spontaneously and other children with autism can learn to imitate, the dynamics of imitation is affected in the large majority of cases. Existing studies from the literature suggest that robots can be used to teach children with autism basic interaction skills like imitation. Based on these findings, in this study, we investigate if children with autism show more social engagement when interacting with an imitative robot (Fig 1) compared to a human partner in a motor imitation task. △ Less

Submitted 27 February, 2020; originally announced February 2020.

Comments: Proceedings of the 2nd International Conference on Innovative Research in Autism (IRIA), France, 2012

arXiv:1809.03258 [pdf, other]

Using phase instead of optical flow for action recognition

Authors: Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, Jan C. van Gemert

Abstract: Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has… ▽ More Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset. △ Less

Submitted 14 September, 2018; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: ECCV-2018 Workshop on "What is Optical Flow for?"

arXiv:1809.03218 [pdf, other]

Hand-tremor frequency estimation in videos

Authors: Silvia L. Pintea, Jian Zheng, Xilin Li, Paulina J. M. Bank, Jacobus J. van Hilten, Jan C. van Gemert

Abstract: We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson's disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the vid… ▽ More We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson's disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the video, and estimate the tremor frequency along the trajectory; and (b) an Eulerian approach where we first localize the hand, we subsequently remove the large motion along the movement trajectory of the hand, and we use the video information over time encoded as intensity values or phase information to estimate the tremor frequency. We estimate hand tremors on a new human tremor dataset, TIM-Tremor, containing static tasks as well as a multitude of more dynamic tasks, involving larger motion of the hands. The dataset has 55 tremor patient recordings together with: associated ground truth accelerometer data from the most affected hand, RGB video data, and aligned depth data. △ Less

Submitted 10 September, 2018; originally announced September 2018.

Comments: Best paper at ECCV-2018 Workshop on Observing and Understanding Hands in Action

arXiv:1805.07170 [pdf, other]

Recurrent knowledge distillation

Authors: Silvia L. Pintea, Yue Liu, Jan C. van Gemert

Abstract: Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of a… ▽ More Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of adding recurrent connections into the student network, and show experimentally on CIFAR-10, Scenes and MiniPlaces, that we can reduce the number of parameters at little loss in accuracy. △ Less

Submitted 18 May, 2018; originally announced May 2018.

Comments: International Conference on Image Processing (ICIP), 2018

arXiv:1803.06962 [pdf, other]

Featureless: Bypassing feature extraction in action categorization

Authors: Silvia L. Pintea, Pascal S. Mettes, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational… ▽ More This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational expense it can predict 2 D video representations as well as 3 D ones, based on motion. The proposed model relies on discriminative Waldboost, which we enhance to a multiclass formulation for the purpose of learning video representations. The suitability of the proposed approach as well as its time efficiency are tested on the UCF11 action recognition dataset. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: Published in the proceedings of the International Conference on Image Processing (ICIP), 2016

arXiv:1803.06952 [pdf, other]

Asymmetric kernel in Gaussian Processes for learning target variance

Authors: Silvia L. Pintea, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center select… ▽ More This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center selects an individualized kernel metric. This enables each center to adjust the kernel space in its vicinity in correspondence with the topology of the targets --- a multi-modal approach. We additionally add descriptiveness by allowing each center to learn a precision matrix. We demonstrate empirically the reliability of the model. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: Accepted in Pattern Recognition Letters, 2018

arXiv:1803.06951 [pdf, other]

Deja Vu: Motion Prediction in Static Images

Authors: Silvia L. Pintea, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a novel method to predict motion from appearance in a… ▽ More This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a novel method to predict motion from appearance in a single static image, (ii) to that end, extends of the Structured Random Forest with regression derived from first principles, and (iii) shows the value of adding motion predictions in different tasks such as: weak frame-proposals containing unexpected events, action recognition, motion saliency. Illustrative results indicate that motion prediction is not only feasible, but also provides valuable information for a number of applications. △ Less

Submitted 21 March, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: Published in the proceedings of the European Conference on Computer Vision (ECCV), 2014

arXiv:1704.04186 [pdf, other]

Video Acceleration Magnification

Authors: Yichao Zhang, Silvia L. Pintea, Jan C. van Gemert

Abstract: The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely distorts current video amplification methods that magnify change linearly. In this work we propose a method to cope with large motions while still magnifying smal… ▽ More The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely distorts current video amplification methods that magnify change linearly. In this work we propose a method to cope with large motions while still magnifying small changes. We make the following two observations: i) large motions are linear on the temporal scale of the small changes; ii) small changes deviate from this linearity. We ignore linear motion and propose to magnify acceleration. Our method is pure Eulerian and does not require any optical flow, temporal alignment or region annotations. We link temporal second-order derivative filtering to spatial acceleration magnification. We apply our method to moving objects where we show motion magnification and color magnification. We provide quantitative as well as qualitative evidence for our method while comparing to the state-of-the-art. △ Less

Submitted 22 April, 2017; v1 submitted 13 April, 2017; originally announced April 2017.

Comments: Accepted paper at CVPR 2017. Project webpage: http://acceleration-magnification.github.io/

arXiv:1702.04125 [pdf, other]

One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network

Authors: Vedran Vukotić, Silvia-Laura Pintea, Christian Raymond, Guillaume Gravier, Jan Van Gemert

Abstract: There is an inherent need for autonomous cars, drones, and other robots to have a notion of how their environment behaves and to anticipate changes in the near future. In this work, we focus on anticipating future appearance given the current frame of a video. Existing work focuses on either predicting the future appearance as the next frame of a video, or predicting future motion as optical flow… ▽ More There is an inherent need for autonomous cars, drones, and other robots to have a notion of how their environment behaves and to anticipate changes in the near future. In this work, we focus on anticipating future appearance given the current frame of a video. Existing work focuses on either predicting the future appearance as the next frame of a video, or predicting future motion as optical flow or motion trajectories starting from a single video frame. This work stretches the ability of CNNs (Convolutional Neural Networks) to predict an anticipation of appearance at an arbitrarily given future time, not necessarily the next video frame. We condition our predicted future appearance on a continuous time variable that allows us to anticipate future frames at a given temporal distance, directly from the input video frame. We show that CNNs can learn an intrinsic representation of typical appearance changes over time and successfully generate realistic predictions at a deliberate time difference in the near future. △ Less

Submitted 24 July, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

Comments: 11 pages, 1 figures, published in the International Conference of Image Analysis and Processing (ICIAP) 2017 and in the Netherlands Conference on Computer Vision (NCCV) 2016

arXiv:1609.01693 [pdf, other]

Making a Case for Learning Motion Representations with Phase

Authors: S. L. Pintea, J. C. van Gemert

Abstract: This work advocates Eulerian motion representation learning over the current standard Lagrangian optical flow model. Eulerian motion is well captured by using phase, as obtained by decomposing the image through a complex-steerable pyramid. We discuss the gain of Eulerian motion in a set of practical use cases: (i) action recognition, (ii) motion prediction in static images, (iii) motion transfer i… ▽ More This work advocates Eulerian motion representation learning over the current standard Lagrangian optical flow model. Eulerian motion is well captured by using phase, as obtained by decomposing the image through a complex-steerable pyramid. We discuss the gain of Eulerian motion in a set of practical use cases: (i) action recognition, (ii) motion prediction in static images, (iii) motion transfer in static images and, (iv) motion transfer in video. For each task we motivate the phase-based direction and provide a possible approach. △ Less

Submitted 8 September, 2016; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: ECCV 2016 Workshop on Brave new ideas for motion representations in videos

Showing 1–24 of 24 results for author: Pintea, S