Search | arXiv e-print repository

arXiv:1910.09469 [pdf, other]

Object landmark discovery through unsupervised adaptation

Authors: Enrique Sanchez, Georgios Tzimiropoulos

Abstract: This paper proposes a method to ease the unsupervised learning of object landmark detectors. Similarly to previous methods, our approach is fully unsupervised in a sense that it does not require or make any use of annotated landmarks for the target object category. Contrary to previous works, we do however assume that a landmark detector, which has already learned a structured representation for a… ▽ More This paper proposes a method to ease the unsupervised learning of object landmark detectors. Similarly to previous methods, our approach is fully unsupervised in a sense that it does not require or make any use of annotated landmarks for the target object category. Contrary to previous works, we do however assume that a landmark detector, which has already learned a structured representation for a given object category in a fully supervised manner, is available. Under this setting, our main idea boils down to adapting the given pre-trained network to the target object categories in a fully unsupervised manner. To this end, our method uses the pre-trained network as a core which remains frozen and does not get updated during training, and learns, in an unsupervised manner, only a projection matrix to perform the adaptation to the target categories. By building upon an existing structured representation learned in a supervised manner, the optimization problem solved by our method is much more constrained with significantly less parameters to learn which seems to be important for the case of unsupervised learning. We show that our method surpasses fully unsupervised techniques trained from scratch as well as a strong baseline based on fine-tuning, and produces state-of-the-art results on several datasets. Code can be found at https://github.com/ESanchezLozano/SAIC-Unsupervised-landmark-detection-NeurIPS2019 . △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019. Code is available https://github.com/ESanchezLozano/SAIC-Unsupervised-landmark-detection-NeurIPS2019

arXiv:1909.13863 [pdf, other]

XNOR-Net++: Improved Binary Neural Networks

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper proposes an improved training algorithm for binary neural networks in which both weights and activations are binary numbers. A key but fairly overlooked feature of the current state-of-the-art method of XNOR-Net is the use of analytically calculated real-valued scaling factors for re-weighting the output of binary convolutions. We argue that analytic calculation of these factors is sub-… ▽ More This paper proposes an improved training algorithm for binary neural networks in which both weights and activations are binary numbers. A key but fairly overlooked feature of the current state-of-the-art method of XNOR-Net is the use of analytically calculated real-valued scaling factors for re-weighting the output of binary convolutions. We argue that analytic calculation of these factors is sub-optimal. Instead, in this work, we make the following contributions: (a) we propose to fuse the activation and weight scaling factors into a single one that is learned discriminatively via backpropagation. (b) More importantly, we explore several ways of constructing the shape of the scale factors while keeping the computational budget fixed. (c) We empirically measure the accuracy of our approximations and show that they are significantly more accurate than the analytically calculated one. (d) We show that our approach significantly outperforms XNOR-Net within the same computational budget when tested on the challenging task of ImageNet classification, offering up to 6\% accuracy gain. △ Less

Submitted 30 September, 2019; originally announced September 2019.

Comments: Accepted to BMVC 2019

arXiv:1909.04951 [pdf, other]

AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces

Authors: Muhammad Haris Khan, John McDonagh, Salman Khan, Muhammad Shahabuddin, Aditya Arora, Fahad Shahbaz Khan, Ling Shao, Georgios Tzimiropoulos

Abstract: Being heavily reliant on animals, it is our ethical obligation to improve their well-being by understanding their needs. Several studies show that animal needs are often expressed through their faces. Though remarkable progress has been made towards the automatic understanding of human faces, this has regrettably not been the case with animal faces. There exists significant room and appropriate ne… ▽ More Being heavily reliant on animals, it is our ethical obligation to improve their well-being by understanding their needs. Several studies show that animal needs are often expressed through their faces. Though remarkable progress has been made towards the automatic understanding of human faces, this has regrettably not been the case with animal faces. There exists significant room and appropriate need to develop automatic systems capable of interpreting animal faces. Among many transformative impacts, such a technology will foster better and cheaper animal healthcare, and further advance animal psychology understanding. We believe the underlying research progress is mainly obstructed by the lack of an adequately annotated dataset of animal faces, covering a wide spectrum of animal species. To this end, we introduce a large-scale, hierarchical annotated dataset of animal faces, featuring 21.9K faces from 334 diverse species and 21 animal orders across biological taxonomy. These faces are captured `in-the-wild' conditions and are consistently annotated with 9 landmarks on key facial features. The proposed dataset is structured and scalable by design; its development underwent four systematic stages involving rigorous, manual annotation effort of over 6K man-hours. We benchmark it for face alignment using the existing art under novel problem settings. Results showcase its challenging nature, unique attributes and present definite prospects for novel, adaptive, and generalized face-oriented CV algorithms. We further benchmark the dataset for face detection and fine-grained recognition tasks, to demonstrate multi-task applications and room for improvement. Experiments indicate that this dataset will push the algorithmic advancements across many related CV tasks and encourage the development of novel systems for animal facial behaviour monitoring. We will make the dataset publicly available. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: 15 pages, 14 figures

arXiv:1904.07852 [pdf, other]

Matrix and tensor decompositions for training binary neural networks

Authors: Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, Maja Pantic

Abstract: This paper is on improving the training of binary neural networks in which both activations and weights are binary. While prior methods for neural network binarization binarize each filter independently, we propose to instead parametrize the weight tensor of each layer using matrix or tensor decomposition. The binarization process is then performed using this latent parametrization, via a quantiza… ▽ More This paper is on improving the training of binary neural networks in which both activations and weights are binary. While prior methods for neural network binarization binarize each filter independently, we propose to instead parametrize the weight tensor of each layer using matrix or tensor decomposition. The binarization process is then performed using this latent parametrization, via a quantization function (e.g. sign function) applied to the reconstructed weights. A key feature of our method is that while the reconstruction is binarized, the computation in the latent factorized space is done in the real domain. This has several advantages: (i) the latent factorization enforces a coupling of the filters before binarization, which significantly improves the accuracy of the trained models. (ii) while at training time, the binary weights of each convolutional layer are parametrized using real-valued matrix or tensor decomposition, during inference we simply use the reconstructed (binary) weights. As a result, our method does not sacrifice any advantage of binary networks in terms of model compression and speeding-up inference. As a further contribution, instead of computing the binary weight scaling factors analytically, as in prior work, we propose to learn them discriminatively via back-propagation. Finally, we show that our approach significantly outperforms existing methods when tested on the challenging tasks of (a) human pose estimation (more than 4% improvements) and (b) ImageNet classification (up to 5% performance gains). △ Less

Submitted 16 April, 2019; originally announced April 2019.

arXiv:1904.06345 [pdf, other]

Incremental multi-domain learning with network latent tensor factorization

Authors: Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, Maja Pantic

Abstract: The prominence of deep learning, large amount of annotated data and increasingly powerful hardware made it possible to reach remarkable performance for supervised classification tasks, in many cases saturating the training sets. However the resulting models are specialized to a single very specific task and domain. Adapting the learned classification to new domains is a hard problem due to at leas… ▽ More The prominence of deep learning, large amount of annotated data and increasingly powerful hardware made it possible to reach remarkable performance for supervised classification tasks, in many cases saturating the training sets. However the resulting models are specialized to a single very specific task and domain. Adapting the learned classification to new domains is a hard problem due to at least three reasons: (1) the new domains and the tasks might be drastically different; (2) there might be very limited amount of annotated data on the new domain and (3) full training of a new model for each new task is prohibitive in terms of computation and memory, due to the sheer number of parameters of deep CNNs. In this paper, we present a method to learn new-domains and tasks incrementally, building on prior knowledge from already learned tasks and without catastrophic forgetting. We do so by jointly parametrizing weights across layers using low-rank Tucker structure. The core is task agnostic while a set of task specific factors are learnt on each new domain. We show that leveraging tensor structure enables better performance than simply using matrix operations. Joint tensor modelling also naturally leverages correlations across different layers. Compared with previous methods which have focused on adapting each layer separately, our approach results in more compact representations for each new task/domain. We apply the proposed method to the 10 datasets of the Visual Decathlon Challenge and show that our method offers on average about 7.5x reduction in number of parameters and competitive performance in terms of both classification accuracy and Decathlon score. △ Less

Submitted 22 November, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: AAAI20

arXiv:1904.05868 [pdf, other]

Improved training of binary networks for human pose estimation and image recognition

Authors: Adrian Bulat, Georgios Tzimiropoulos, Jean Kossaifi, Maja Pantic

Abstract: Big neural networks trained on large datasets have advanced the state-of-the-art for a large variety of challenging problems, improving performance by a large margin. However, under low memory and limited computational power constraints, the accuracy on the same problems drops considerable. In this paper, we propose a series of techniques that significantly improve the accuracy of binarized neural… ▽ More Big neural networks trained on large datasets have advanced the state-of-the-art for a large variety of challenging problems, improving performance by a large margin. However, under low memory and limited computational power constraints, the accuracy on the same problems drops considerable. In this paper, we propose a series of techniques that significantly improve the accuracy of binarized neural networks (i.e networks where both the features and the weights are binary). We evaluate the proposed improvements on two diverse tasks: fine-grained recognition (human pose estimation) and large-scale image recognition (ImageNet classification). Specifically, we introduce a series of novel methodological changes including: (a) more appropriate activation functions, (b) reverse-order initialization, (c) progressive quantization, and (d) network stacking and show that these additions improve existing state-of-the-art network binarization techniques, significantly. Additionally, for the first time, we also investigate the extent to which network binarization and knowledge distillation can be combined. When tested on the challenging MPII dataset, our method shows a performance improvement of more than 4% in absolute terms. Finally, we further validate our findings by applying the proposed techniques for large-scale object recognition on the Imagenet dataset, on which we report a reduction of error rate by 4%. △ Less

Submitted 11 April, 2019; originally announced April 2019.

arXiv:1904.02698 [pdf, other]

T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor

Authors: Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, Maja Pantic

Abstract: Recent findings indicate that over-parametrization, while crucial for successfully training deep neural networks, also introduces large amounts of redundancy. Tensor methods have the potential to efficiently parametrize over-complete representations by leveraging this redundancy. In this paper, we propose to fully parametrize Convolutional Neural Networks (CNNs) with a single high-order, low-rank… ▽ More Recent findings indicate that over-parametrization, while crucial for successfully training deep neural networks, also introduces large amounts of redundancy. Tensor methods have the potential to efficiently parametrize over-complete representations by leveraging this redundancy. In this paper, we propose to fully parametrize Convolutional Neural Networks (CNNs) with a single high-order, low-rank tensor. Previous works on network tensorization have focused on parametrizing individual layers (convolutional or fully connected) only, and perform the tensorization layer-by-layer separately. In contrast, we propose to jointly capture the full structure of a neural network by parametrizing it with a single high-order tensor, the modes of which represent each of the architectural design parameters of the network (e.g. number of convolutional blocks, depth, number of stacks, input features, etc). This parametrization allows to regularize the whole network and drastically reduce the number of parameters. Our model is end-to-end trainable and the low-rank structure imposed on the weight tensor acts as an implicit regularization. We study the case of networks with rich structure, namely Fully Convolutional Networks (FCNs), which we propose to parametrize with a single 8th-order tensor. We show that our approach can achieve superior performance with small compression rates, and attain high compression rates with negligible drop in accuracy for the challenging task of human pose estimation. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Comments: CVPR 2019

arXiv:1812.05082 [pdf, other]

Features Extraction Based on an Origami Representation of 3D Landmarks

Authors: Juan Manuel Fernandez Montenegro, Mahdi Maktab Dar Oghaz, Athanasios Gkelias, Georgios Tzimiropoulos, Vasileios Argyriou

Abstract: Feature extraction analysis has been widely investigated during the last decades in computer vision community due to the large range of possible applications. Significant work has been done in order to improve the performance of the emotion detection methods. Classification algorithms have been refined, novel preprocessing techniques have been applied and novel representations from images and vide… ▽ More Feature extraction analysis has been widely investigated during the last decades in computer vision community due to the large range of possible applications. Significant work has been done in order to improve the performance of the emotion detection methods. Classification algorithms have been refined, novel preprocessing techniques have been applied and novel representations from images and videos have been introduced. In this paper, we propose a preprocessing method and a novel facial landmarks' representation aiming to improve the facial emotion detection accuracy. We apply our novel methodology on the extended Cohn-Kanade (CK+) dataset and other datasets for affect classification based on Action Units (AU). The performance evaluation demonstrates an improvement on facial emotion classification (accuracy and F1 score) that indicates the superiority of the proposed methodology. △ Less

Submitted 12 December, 2018; originally announced December 2018.

arXiv:1812.02486 [pdf, other]

Learning to Infer the Depth Map of a Hand from its Color Image

Authors: Vassilis C. Nicodemou, Iason Oikonomidis, Georgios Tzimiropoulos, Antonis Argyros

Abstract: We propose the first approach to the problem of inferring the depth map of a human hand based on a single RGB image. We achieve this with a Convolutional Neural Network (CNN) that employs a stacked hourglass model as its main building block. Intermediate supervision is used in several outputs of the proposed architecture in a staged approach. To aid the process of training and inference, hand segm… ▽ More We propose the first approach to the problem of inferring the depth map of a human hand based on a single RGB image. We achieve this with a Convolutional Neural Network (CNN) that employs a stacked hourglass model as its main building block. Intermediate supervision is used in several outputs of the proposed architecture in a staged approach. To aid the process of training and inference, hand segmentation masks are also estimated in such an intermediate supervision step, and used to guide the subsequent depth estimation process. In order to train and evaluate the proposed method we compile and make publicly available HandRGBD, a new dataset of 20,601 views of hands, each consisting of an RGB image and an aligned depth map. Based on HandRGBD, we explore variants of the proposed approach in an ablative study and determine the best performing one. The results of an extensive experimental evaluation demonstrate that hand depth estimation from a single RGB frame can be achieved with an accuracy of 22mm, which is comparable to the accuracy achieved by contemporary low-cost depth cameras. Such a 3D reconstruction of hands based on RGB information is valuable as a final result on its own right, but also as an input to several other hand analysis and perception algorithms that require depth input. Essentially, in such a context, the proposed approach bridges the gap between RGB and RGBD, by making all existing RGBD-based methods applicable to RGB input. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1811.01194 [pdf, other]

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Authors: Themos Stafylakis, Muhammad Haris Khan, Georgios Tzimiropoulos

Abstract: Visual and audiovisual speech recognition are witnessing a renaissance which is largely due to the advent of deep learning methods. In this paper, we present a deep learning architecture for lipreading and audiovisual word recognition, which combines Residual Networks equipped with spatiotemporal input layers and Bidirectional LSTMs. The lipreading architecture attains 11.92% misclassification rat… ▽ More Visual and audiovisual speech recognition are witnessing a renaissance which is largely due to the advent of deep learning methods. In this paper, we present a deep learning architecture for lipreading and audiovisual word recognition, which combines Residual Networks equipped with spatiotemporal input layers and Bidirectional LSTMs. The lipreading architecture attains 11.92% misclassification rate on the challenging Lipreading-In-The-Wild database, which is composed of excerpts from BBC-TV, each containing one of the 500 target words. Audiovisual experiments are performed using both intermediate and late integration, as well as several types and levels of environmental noise, and notable improvements over the audio-only network are reported, even in the case of clean speech. A further analysis on the utility of target word boundaries is provided, as well as on the capacity of the network in modeling the linguistic context of the target word. Finally, we examine difficult word pairs and discuss how visual information helps towards attaining higher recognition accuracy. △ Less

Submitted 3 November, 2018; originally announced November 2018.

Comments: Accepted to Computer Vision and Image Understanding (Elsevier)

arXiv:1810.00108 [pdf, other]

Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Authors: Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, Maja Pantic

Abstract: Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic al… ▽ More Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the audio-only model and achieves the new state-of-the-art performance on LRS2 database (7% word error rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9% absolute improvement in word error rate) for several different types of noise as the signal-to-noise ratio decreases. △ Less

Submitted 28 September, 2018; originally announced October 2018.

Comments: Accepted to IEEE SLT 2018

arXiv:1809.03770 [pdf, other]

3D Human Body Reconstruction from a Single Image via Volumetric Regression

Authors: Aaron S. Jackson, Chris Manafas, Georgios Tzimiropoulos

Abstract: This paper proposes the use of an end-to-end Convolutional Neural Network for direct reconstruction of the 3D geometry of humans via volumetric regression. The proposed method does not require the fitting of a shape model and can be trained to work from a variety of input types, whether it be landmarks, images or segmentation masks. Additionally, non-visible parts, either self-occluded or otherwis… ▽ More This paper proposes the use of an end-to-end Convolutional Neural Network for direct reconstruction of the 3D geometry of humans via volumetric regression. The proposed method does not require the fitting of a shape model and can be trained to work from a variety of input types, whether it be landmarks, images or segmentation masks. Additionally, non-visible parts, either self-occluded or otherwise, are still reconstructed, which is not the case with depth map regression. We present results that show that our method can handle both pose variation and detailed reconstruction given appropriate datasets for training. △ Less

Submitted 11 September, 2018; originally announced September 2018.

Comments: Accepted to ECCV Workshops (PeopleCap) 2018

arXiv:1808.04803 [pdf, other]

doi 10.1109/TPAMI.2018.2866051

Hierarchical binary CNNs for landmark localization with limited resources

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: Our goal is to design architectures that retain the groundbreaking performance of Convolutional Neural Networks (CNNs) for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tas… ▽ More Our goal is to design architectures that retain the groundbreaking performance of Convolutional Neural Networks (CNNs) for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation and face alignment. We exhaustively evaluate various design choices, identify performance bottlenecks, and more importantly propose multiple orthogonal ways to boost performance. (b) Based on our analysis, we propose a novel hierarchical, parallel and multi-scale residual architecture that yields large performance improvement over the standard bottleneck block while having the same number of parameters, thus bridging the gap between the original network and its binarized counterpart. (c) We perform a large number of ablation studies that shed light on the properties and the performance of the proposed block. (d) We present results for experiments on the most challenging datasets for human pose estimation and face alignment, reporting in many cases state-of-the-art performance. (e) We further provide additional results for the problem of facial part segmentation. Code can be downloaded from https://www.adrianbulat.com/binary-cnn-landmark △ Less

Submitted 14 August, 2018; originally announced August 2018.

Comments: Accepted to IEEE TPAMI18: Best of ICCV 2017 SI. Previously portions of this work appeared as arXiv:1703.00862, which was the conference version

arXiv:1807.11458 [pdf, other]

To learn image super-resolution, use a GAN to learn how to do image degradation first

Authors: Adrian Bulat, Jing Yang, Georgios Tzimiropoulos

Abstract: This paper is on image and face super-resolution. The vast majority of prior work for this problem focus on how to increase the resolution of low-resolution images which are artificially generated by simple bilinear down-sampling (or in a few cases by blurring followed by down-sampling).We show that such methods fail to produce good results when applied to real-world low-resolution, low quality im… ▽ More This paper is on image and face super-resolution. The vast majority of prior work for this problem focus on how to increase the resolution of low-resolution images which are artificially generated by simple bilinear down-sampling (or in a few cases by blurring followed by down-sampling).We show that such methods fail to produce good results when applied to real-world low-resolution, low quality images. To circumvent this problem, we propose a two-stage process which firstly trains a High-to-Low Generative Adversarial Network (GAN) to learn how to degrade and downsample high-resolution images requiring, during training, only unpaired high and low-resolution images. Once this is achieved, the output of this network is used to train a Low-to-High GAN for image super-resolution using this time paired low- and high-resolution images. Our main result is that this network can be now used to efectively increase the quality of real-world low-resolution images. We have applied the proposed pipeline for the problem of face super-resolution where we report large improvement over baselines and prior work although the proposed method is potentially applicable to other object categories. △ Less

Submitted 30 July, 2018; originally announced July 2018.

Comments: Accepted to ECCV18

arXiv:1807.08469 [pdf, other]

Zero-shot keyword spotting for visual speech recognition in-the-wild

Authors: Themos Stafylakis, Georgios Tzimiropoulos

Abstract: Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extract… ▽ More Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods. △ Less

Submitted 25 July, 2018; v1 submitted 23 July, 2018; originally announced July 2018.

Comments: Accepted at ECCV-2018

arXiv:1805.03487 [pdf, other]

Joint Action Unit localisation and intensity estimation through heatmap regression

Authors: Enrique Sanchez-Lozano, Georgios Tzimiropoulos, Michel Valstar

Abstract: This paper proposes a supervised learning approach to jointly perform facial Action Unit (AU) localisation and intensity estimation. Contrary to previous works that try to learn an unsupervised representation of the Action Unit regions, we propose to directly and jointly estimate all AU intensities through heatmap regression, along with the location in the face where they cause visible changes. Ou… ▽ More This paper proposes a supervised learning approach to jointly perform facial Action Unit (AU) localisation and intensity estimation. Contrary to previous works that try to learn an unsupervised representation of the Action Unit regions, we propose to directly and jointly estimate all AU intensities through heatmap regression, along with the location in the face where they cause visible changes. Our approach aims to learn a pixel-wise regression function returning a score per AU, which indicates an AU intensity at a given spatial location. Heatmap regression then generates an image, or channel, per AU, in which each pixel indicates the corresponding AU intensity. To generate the ground-truth heatmaps for a target AU, the facial landmarks are first estimated, and a 2D Gaussian is drawn around the points where the AU is known to cause changes. The amplitude and size of the Gaussian is determined by the intensity of the AU. We show that using a single Hourglass network suffices to attain new state of the art results, demonstrating the effectiveness of such a simple approach. The use of heatmap regression allows learning of a shared representation between AUs without the need to rely on latent representations, as these are implicitly learned from the data. We validate the proposed approach on the BP4D dataset, showing a modest improvement on recent, complex, techniques, as well as robustness against misalignment errors. Code for testing and models will be available to download from https://github.com/ESanchezLozano/Action-Units-Heatmaps. △ Less

Submitted 20 July, 2018; v1 submitted 9 May, 2018; originally announced May 2018.

Comments: BMVC 2018. Code and model will be available to download from https://github.com/ESanchezLozano/Action-Units-Heatmaps

arXiv:1802.06424 [pdf, ps, other]

End-to-end Audiovisual Speech Recognition

Authors: Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, Maja Pantic

Abstract: Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the be… ▽ More Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models. △ Less

Submitted 22 February, 2018; v1 submitted 18 February, 2018; originally announced February 2018.

Comments: Accepted to ICASSP 2018

arXiv:1712.02765 [pdf, other]

Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper addresses 2 challenging tasks: improving the quality of low resolution facial images and accurately locating the facial landmarks on such poor resolution images. To this end, we make the following 5 contributions: (a) we propose Super-FAN: the very first end-to-end system that addresses both tasks simultaneously, i.e. both improves face resolution and detects the facial landmarks. The n… ▽ More This paper addresses 2 challenging tasks: improving the quality of low resolution facial images and accurately locating the facial landmarks on such poor resolution images. To this end, we make the following 5 contributions: (a) we propose Super-FAN: the very first end-to-end system that addresses both tasks simultaneously, i.e. both improves face resolution and detects the facial landmarks. The novelty or Super-FAN lies in incorporating structural information in a GAN-based super-resolution algorithm via integrating a sub-network for face alignment through heatmap regression and optimizing a novel heatmap loss. (b) We illustrate the benefit of training the two networks jointly by reporting good results not only on frontal images (as in prior work) but on the whole spectrum of facial poses, and not only on synthetic low resolution images (as in prior work) but also on real-world images. (c) We improve upon the state-of-the-art in face super-resolution by proposing a new residual-based architecture. (d) Quantitatively, we show large improvement over the state-of-the-art for both face super-resolution and alignment. (e) Qualitatively, we show for the first time good results on real-world low resolution images. △ Less

Submitted 27 March, 2018; v1 submitted 7 December, 2017; originally announced December 2017.

Comments: CVPR 2018 SPOTLIGHT

arXiv:1710.11201 [pdf, other]

Deep word embeddings for visual speech recognition

Authors: Themos Stafylakis, Georgios Tzimiropoulos

Abstract: In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network a… ▽ More In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set. △ Less

Submitted 30 October, 2017; originally announced October 2017.

arXiv:1703.07834 [pdf, other]

Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression

Authors: Aaron S. Jackson, Adrian Bulat, Vasileios Argyriou, Georgios Tzimiropoulos

Abstract: 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these method… ▽ More 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these methods require complex and inefficient pipelines for model building and fitting. In this work, we propose to address many of these limitations by training a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model. We achieve this via a simple CNN architecture that performs direct regression of a volumetric representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related task of facial landmark localization can be incorporated into the proposed framework and help improve reconstruction quality, especially for the cases of large poses and facial expressions. Testing code will be made available online, along with pre-trained models http://aaronsplace.co.uk/papers/jackson2017recon △ Less

Submitted 8 September, 2017; v1 submitted 22 March, 2017; originally announced March 2017.

Comments: 10 pages, ICCV 2017

arXiv:1703.07332 [pdf, other]

doi 10.1109/ICCV.2017.116

How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following 5 contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very l… ▽ More This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following 5 contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets. (b) We create a guided by 2D landmarks network which converts 2D landmark annotations to 3D and unifies all existing datasets, leading to the creation of LS3D-W, the largest and most challenging 3D facial landmark dataset to date ~230,000 images. (c) Following that, we train a neural network for 3D face alignment and evaluate it on the newly introduced LS3D-W. (d) We further look into the effect of all "traditional" factors affecting face alignment performance like large pose, initialization and resolution, and introduce a "new" one, namely the size of the network. (e) We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used. Training and testing code as well as the dataset can be downloaded from https://www.adrianbulat.com/face-alignment/ △ Less

Submitted 7 September, 2017; v1 submitted 21 March, 2017; originally announced March 2017.

Comments: accepted to ICCV 2017

arXiv:1703.04105 [pdf, other]

Combining Residual Networks with LSTMs for Lipreading

Authors: Themos Stafylakis, Georgios Tzimiropoulos

Abstract: We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The propos… ▽ More We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing. △ Less

Submitted 8 September, 2017; v1 submitted 12 March, 2017; originally announced March 2017.

Comments: Submitted to Interspeech 2017

arXiv:1703.00862 [pdf, other]

doi 10.1109/ICCV.2017.400

Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: Our goal is to design architectures that retain the groundbreaking performance of CNNs for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation… ▽ More Our goal is to design architectures that retain the groundbreaking performance of CNNs for landmark localization and at the same time are lightweight, compact and suitable for applications with limited computational resources. To this end, we make the following contributions: (a) we are the first to study the effect of neural network binarization on localization tasks, namely human pose estimation and face alignment. We exhaustively evaluate various design choices, identify performance bottlenecks, and more importantly propose multiple orthogonal ways to boost performance. (b) Based on our analysis, we propose a novel hierarchical, parallel and multi-scale residual architecture that yields large performance improvement over the standard bottleneck block while having the same number of parameters, thus bridging the gap between the original network and its binarized counterpart. (c) We perform a large number of ablation studies that shed light on the properties and the performance of the proposed block. (d) We present results for experiments on the most challenging datasets for human pose estimation and face alignment, reporting in many cases state-of-the-art performance. Code can be downloaded from https://www.adrianbulat.com/binary-cnn-landmarks △ Less

Submitted 7 August, 2017; v1 submitted 2 March, 2017; originally announced March 2017.

Comments: ICCV 2017 Oral

arXiv:1612.02203 [pdf, other]

doi 10.1109/TPAMI.2017.2745568

A Functional Regression approach to Facial Landmark Tracking

Authors: Enrique Sánchez-Lozano, Georgios Tzimiropoulos, Brais Martinez, Fernando De la Torre, Michel Valstar

Abstract: Linear regression is a fundamental building block in many face detection and tracking algorithms, typically used to predict shape displacements from image features through a linear mapping. This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker. Contrary to prior work in Funct… ▽ More Linear regression is a fundamental building block in many face detection and tracking algorithms, typically used to predict shape displacements from image features through a linear mapping. This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker. Contrary to prior work in Functional Regression, in which B-splines or Fourier series were used, we propose to approximate the input space by its first-order Taylor expansion, yielding a closed-form solution for the continuous domain of displacements. We then extend the continuous least squares problem to correlated variables, and demonstrate the generalisation of our approach. We incorporate Continuous Regression into the cascaded regression framework, and show its computational benefits for both training and testing. We then present a fast approach for incremental learning within Cascaded Continuous Regression, coined iCCR, and show that its complexity allows real-time face tracking, being 20 times faster than the state of the art. To the best of our knowledge, this is the first incremental face tracker that is shown to operate in real-time. We show that iCCR achieves state-of-the-art performance on the 300-VW dataset, the most recent, large-scale benchmark for face tracking. △ Less

Submitted 20 September, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

Comments: Accepted at IEEE TPAMI. This is authors' version. 0162-8828 ©2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

arXiv:1609.09642 [pdf, other]

A CNN Cascade for Landmark Guided Semantic Part Segmentation

Authors: Aaron Jackson, Michel Valstar, Georgios Tzimiropoulos

Abstract: This paper proposes a CNN cascade for semantic part segmentation guided by pose-specific information encoded in terms of a set of landmarks (or keypoints). There is large amount of prior work on each of these tasks separately, yet, to the best of our knowledge, this is the first time in literature that the interplay between pose estimation and semantic part segmentation is investigated. To address… ▽ More This paper proposes a CNN cascade for semantic part segmentation guided by pose-specific information encoded in terms of a set of landmarks (or keypoints). There is large amount of prior work on each of these tasks separately, yet, to the best of our knowledge, this is the first time in literature that the interplay between pose estimation and semantic part segmentation is investigated. To address this limitation of prior work, in this paper, we propose a CNN cascade of tasks that firstly performs landmark localisation and then uses this information as input for guiding semantic part segmentation. We applied our architecture to the problem of facial part segmentation and report large performance improvement over the standard unguided network on the most challenging face datasets. Testing code and models will be published online at http://cs.nott.ac.uk/~psxasj/. △ Less

Submitted 30 September, 2016; originally announced September 2016.

Comments: accepted to Geometry Meets Deep Learning ECCV 2016 Workshop

arXiv:1609.09545 [pdf, other]

doi 10.1007/978-3-319-48881-3_43

Two-stage Convolutional Part Heatmap Regression for the 1st 3D Face Alignment in the Wild (3DFAW) Challenge

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper describes our submission to the 1st 3D Face Alignment in the Wild (3DFAW) Challenge. Our method builds upon the idea of convolutional part heatmap regression [1], extending it for 3D face alignment. Our method decomposes the problem into two parts: (a) X,Y (2D) estimation and (b) Z (depth) estimation. At the first stage, our method estimates the X,Y coordinates of the facial landmarks b… ▽ More This paper describes our submission to the 1st 3D Face Alignment in the Wild (3DFAW) Challenge. Our method builds upon the idea of convolutional part heatmap regression [1], extending it for 3D face alignment. Our method decomposes the problem into two parts: (a) X,Y (2D) estimation and (b) Z (depth) estimation. At the first stage, our method estimates the X,Y coordinates of the facial landmarks by producing a set of 2D heatmaps, one for each landmark, using convolutional part heatmap regression. Then, these heatmaps, alongside the input RGB image, are used as input to a very deep subnetwork trained via residual learning for regressing the Z coordinate. Our method ranked 1st in the 3DFAW Challenge, surpassing the second best result by more than 22%. △ Less

Submitted 29 September, 2016; originally announced September 2016.

Comments: Winner of 3D Face Alignment in the Wild (3DFAW) Challenge, ECCV 2016

arXiv:1609.01743 [pdf, other]

doi 10.1007/978-3-319-46478-7_44

Human pose estimation via Convolutional Part Heatmap Regression

Authors: Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detectio… ▽ More This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets. Code can be downloaded from http://www.cs.nott.ac.uk/~psxab5/ △ Less

Submitted 6 September, 2016; originally announced September 2016.

Comments: accepted to ECCV 2016

arXiv:1608.01137 [pdf, other]

Cascaded Continuous Regression for Real-time Incremental Face Tracking

Authors: Enrique Sánchez-Lozano, Brais Martinez, Georgios Tzimiropoulos, Michel Valstar

Abstract: This paper introduces a novel real-time algorithm for facial landmark tracking. Compared to detection, tracking has both additional challenges and opportunities. Arguably the most important aspect in this domain is updating a tracker's models as tracking progresses, also known as incremental (face) tracking. While this should result in more accurate localisation, how to do this online and in real… ▽ More This paper introduces a novel real-time algorithm for facial landmark tracking. Compared to detection, tracking has both additional challenges and opportunities. Arguably the most important aspect in this domain is updating a tracker's models as tracking progresses, also known as incremental (face) tracking. While this should result in more accurate localisation, how to do this online and in real time without causing a tracker to drift is still an important open research question. We address this question in the cascaded regression framework, the state-of-the-art approach for facial landmark localisation. Because incremental learning for cascaded regression is costly, we propose a much more efficient yet equally accurate alternative using continuous regression. More specifically, we first propose cascaded continuous regression (CCR) and show its accuracy is equivalent to the Supervised Descent Method. We then derive the incremental learning updates for CCR (iCCR) and show that it is an order of magnitude faster than standard incremental learning for cascaded regression, bringing the time required for the update from seconds down to a fraction of a second, thus enabling real-time tracking. Finally, we evaluate iCCR and show the importance of incremental learning in achieving state-of-the-art performance. Code for our iCCR is available from http://www.cs.nott.ac.uk/~psxes1 △ Less

Submitted 6 August, 2016; v1 submitted 3 August, 2016; originally announced August 2016.

Comments: ECCV 2016 accepted paper, with supplementary material included as appendices. References to Equations fixed

arXiv:1005.2715 [pdf, ps, other]

On the Subspace of Image Gradient Orientations

Authors: Georgios Tzimiropoulos, Stefanos Zafeiriou

Abstract: We introduce the notion of Principal Component Analysis (PCA) of image gradient orientations. As image data is typically noisy, but noise is substantially different from Gaussian, traditional PCA of pixel intensities very often fails to estimate reliably the low-dimensional subspace of a given data population. We show that replacing intensities with gradient orientations and the $\ell_2$ norm with… ▽ More We introduce the notion of Principal Component Analysis (PCA) of image gradient orientations. As image data is typically noisy, but noise is substantially different from Gaussian, traditional PCA of pixel intensities very often fails to estimate reliably the low-dimensional subspace of a given data population. We show that replacing intensities with gradient orientations and the $\ell_2$ norm with a cosine-based distance measure offers, to some extend, a remedy to this problem. Our scheme requires the eigen-decomposition of a covariance matrix and is as computationally efficient as standard $\ell_2$ PCA. We demonstrate some of its favorable properties on robust subspace estimation. △ Less

Submitted 15 May, 2010; originally announced May 2010.

Showing 51–79 of 79 results for author: Tzimiropoulos, G