Search | arXiv e-print repository

arXiv:2004.08769 [pdf, other]

doi 10.1007/s00138-020-01164-4

TriGAN: Image-to-Image Translation for Multi-Source Domain Adaptation

Authors: Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe, Elisa Ricci

Abstract: Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method is inspired by the observation that the appearance… ▽ More Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method is inspired by the observation that the appearance of a given image depends on three factors: the domain, the style (characterized in terms of low-level features variations) and the content. For this reason we propose to project the image features onto a space where only the dependence from the content is kept, and then re-project this invariant representation onto the pixel space using the target domain and style. In this way, new labeled images can be generated which are used to train a final target classifier. We test our approach using common MSDA benchmarks, showing that it outperforms state-of-the-art methods. △ Less

Submitted 19 April, 2020; originally announced April 2020.

Journal ref: Machine Vision and Applications 2021

arXiv:2004.05973 [pdf, other]

Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset

Authors: Shreya Ghosh, Abhinav Dhall, Garima Sharma, Sarthak Gupta, Nicu Sebe

Abstract: Labelling of human behavior analysis data is a complex and time consuming task. In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed. Domain knowledge is added to the data recording paradigm and later labels are generated in an automatic manner using Speech To Text conversion (STT). In order to remove the noise in… ▽ More Labelling of human behavior analysis data is a complex and time consuming task. In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed. Domain knowledge is added to the data recording paradigm and later labels are generated in an automatic manner using Speech To Text conversion (STT). In order to remove the noise in the STT process due to different illumination and ethnicity of subjects in our data, the speech frequency and energy are analysed. The resultant Driver Gaze in the Wild (DGW) dataset contains 586 recordings, captured during different times of the day including evenings. The large scale dataset contains 338 subjects with an age range of 18-63 years. As the data is recorded in different lighting conditions, an illumination robust layer is proposed in the Convolutional Neural Network (CNN). The extensive experiments show the variance in the dataset resembling real-world conditions and the effectiveness of the proposed CNN pipeline. The proposed network is also fine-tuned for the eye gaze prediction task, which shows the discriminativeness of the representation learnt by our network on the proposed DGW dataset. Project Page: https://sites.google.com/view/drivergazeprediction/home △ Less

Submitted 18 October, 2021; v1 submitted 13 April, 2020; originally announced April 2020.

arXiv:2004.05551 [pdf, other]

OpenMix: Reviving Known Knowledge for Discovering Novel Visual Categories in An Open World

Authors: Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi Yang, Nicu Sebe

Abstract: In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes. Existing methods typically first pre-train a model with labeled data, and then identify new classes in unlabeled data via unsupervised clustering. However, the labeled data that provide essential knowledge are often underexplored in the second step. The challenge is th… ▽ More In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes. Existing methods typically first pre-train a model with labeled data, and then identify new classes in unlabeled data via unsupervised clustering. However, the labeled data that provide essential knowledge are often underexplored in the second step. The challenge is that the labeled and unlabeled examples are from non-overlapping classes, which makes it difficult to build the learning relationship between them. In this work, we introduce OpenMix to mix the unlabeled examples from an open set and the labeled examples from known classes, where their non-overlapping labels and pseudo-labels are simultaneously mixed into a joint label distribution. OpenMix dynamically compounds examples in two ways. First, we produce mixed training images by incorporating labeled examples with unlabeled examples. With the benefits of unique prior knowledge in novel class discovery, the generated pseudo-labels will be more credible than the original unlabeled predictions. As a result, OpenMix helps to prevent the model from overfitting on unlabeled samples that may be assigned with wrong pseudo-labels. Second, the first way encourages the unlabeled examples with high class-probabilities to have considerable accuracy. We introduce these examples as reliable anchors and further integrate them with unlabeled samples. This enables us to generate more combinations in unlabeled examples and exploit finer object relations among the new classes. Experiments on three classification datasets demonstrate the effectiveness of the proposed OpenMix, which is superior to state-of-the-art methods in novel class discovery. △ Less

Submitted 12 April, 2020; originally announced April 2020.

arXiv:2004.03333 [pdf, other]

doi 10.1016/j.patcog.2020.107281

Binary Neural Networks: A Survey

Authors: Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, Nicu Sebe

Abstract: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. To address these issues, a variety of algorithms have been proposed, and achieve… ▽ More The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. To address these issues, a variety of algorithms have been proposed, and achieved satisfying progress in recent years. In this paper, we present a comprehensive survey of these algorithms, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error. We also investigate other practical aspects of binary neural networks such as the hardware-friendly design and the training tricks. Then, we give the evaluation and discussions on different tasks, including image classification, object detection and semantic segmentation. Finally, the challenges that may be faced in future research are prospected. △ Less

Submitted 31 March, 2020; originally announced April 2020.

Journal ref: Pattern Recognition (2020) 107281

arXiv:2004.03234 [pdf, other]

Motion-supervised Co-Part Segmentation

Authors: Aliaksandr Siarohin, Subhankar Roy, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

Abstract: Recent co-part segmentation methods mostly operate in a supervised learning setting, which requires a large amount of annotated data for training. To overcome this limitation, we propose a self-supervised deep learning method for co-part segmentation. Differently from previous works, our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful… ▽ More Recent co-part segmentation methods mostly operate in a supervised learning setting, which requires a large amount of annotated data for training. To overcome this limitation, we propose a self-supervised deep learning method for co-part segmentation. Differently from previous works, our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts. To this end, our method relies on pairs of frames sampled from the same video. The network learns to predict part segments together with a representation of the motion between two frames, which permits reconstruction of the target image. Through extensive experimental evaluation on publicly available video sequences we demonstrate that our approach can produce improved segmentation maps with respect to previous self-supervised co-part segmentation approaches. △ Less

Submitted 15 April, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Journal ref: ICPR 2021

arXiv:2004.03064 [pdf, other]

Coarse-to-Fine Gaze Redirection with Numerical and Pictorial Guidance

Authors: Jingjing Chen, Jichao Zhang, Enver Sangineto, Jiayuan Fan, Tao Chen, Nicu Sebe

Abstract: Gaze redirection aims at manipulating the gaze of a given face image with respect to a desired direction (i.e., a reference angle) and it can be applied to many real life scenarios, such as video-conferencing or taking group photos. However, previous work on this topic mainly suffers of two limitations: (1) Low-quality image generation and (2) Low redirection precision. In this paper, we propose t… ▽ More Gaze redirection aims at manipulating the gaze of a given face image with respect to a desired direction (i.e., a reference angle) and it can be applied to many real life scenarios, such as video-conferencing or taking group photos. However, previous work on this topic mainly suffers of two limitations: (1) Low-quality image generation and (2) Low redirection precision. In this paper, we propose to alleviate these problems by means of a novel gaze redirection framework which exploits both a numerical and a pictorial direction guidance, jointly with a coarse-to-fine learning strategy. Specifically, the coarse branch learns the spatial transformation which warps input image according to desired gaze. On the other hand, the fine-grained branch consists of a generator network with conditional residual image learning and a multi-task discriminator. This second branch reduces the gap between the previously warped image and the ground-truth image and recovers finer texture details. Moreover, we propose a numerical and pictorial guidance module~(NPG) which uses a pictorial gazemap description and numerical angles as an extra guide to further improve the precision of gaze redirection. Extensive experiments on a benchmark dataset show that the proposed method outperforms the state-of-the-art approaches in terms of both image quality and redirection precision. The code is available at https://github.com/jingjingchen777/CFGR △ Less

Submitted 26 November, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 12 pages, accepted by WACV 2021

arXiv:2003.13898 [pdf, other]

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

Authors: Hao Tang, Xiaojuan Qi, Guolei Sun, Dan Xu, Nicu Sebe, Radu Timofte, Luc Van Gool

Abstract: We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operati… ▽ More We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results. 3) Existing semantic image synthesis methods focus on modeling local semantic information from a single input semantic layout. However, they ignore global semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods. △ Less

Submitted 27 March, 2023; v1 submitted 30 March, 2020; originally announced March 2020.

arXiv:2003.06788 [pdf, other]

GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling

Authors: Yahui Liu, Marco De Nadai, Jian Yao, Nicu Sebe, Bruno Lepri, Xavier Alameda-Pineda

Abstract: Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem… ▽ More Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem known as mode collapse. To overcome these limitations, we propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the attribute space is fitted with a GMM. Each GMM component represents a domain, and this simple assumption has two prominent advantages. First, it can be easily extended to most multi-domain and multi-modal image-to-image translation tasks. Second, the continuous domain encoding allows for interpolation between domains and for extrapolation to unseen domains and translations. Additionally, we show how GMM-UNIT can be constrained down to different methods in the literature, meaning that GMM-UNIT is a unifying framework for unsupervised image-to-image translation. △ Less

Submitted 21 March, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

Comments: 27 pages, 17 figures

arXiv:2003.03229 [pdf, other]

doi 10.1007/s10489-023-04921-w

Non-linear Neurons with Human-like Apical Dendrite Activations

Authors: Mariana-Iuliana Georgescu, Radu Tudor Ionescu, Nicolae-Catalin Ristea, Nicu Sebe

Abstract: In order to classify linearly non-separable data, neurons are typically organized into multi-layer neural networks that are equipped with at least one hidden layer. Inspired by some recent discoveries in neuroscience, we propose a new model of artificial neuron along with a novel activation function enabling the learning of nonlinear decision boundaries using a single neuron. We show that a standa… ▽ More In order to classify linearly non-separable data, neurons are typically organized into multi-layer neural networks that are equipped with at least one hidden layer. Inspired by some recent discoveries in neuroscience, we propose a new model of artificial neuron along with a novel activation function enabling the learning of nonlinear decision boundaries using a single neuron. We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. Furthermore, we conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing, i.e. MOROCO, UTKFace, CREMA-D, Fashion-MNIST, Tiny ImageNet and ImageNet, showing that the ADA and the leaky ADA functions provide superior results to Rectified Linear Units (ReLU), leaky ReLU, RBF and Swish, for various neural network architectures, e.g. one-hidden-layer or two-hidden-layer multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) such as LeNet, VGG, ResNet and Character-level CNN. We obtain further performance improvements when we change the standard model of the neuron with our pyramidal neuron with apical dendrite activations (PyNADA). Our code is available at: https://github.com/raduionescu/pynada. △ Less

Submitted 10 August, 2023; v1 submitted 2 February, 2020; originally announced March 2020.

Comments: Accepted for publication in Applied Intelligence

arXiv:2003.00196 [pdf, other]

First Order Motion Model for Image Animation

Authors: Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

Abstract: Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to… ▽ More Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. Our source code is publicly available. △ Less

Submitted 1 October, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: NeurIPS 2019

arXiv:2002.01048 [pdf, other]

Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation

Authors: Hao Tang, Philip H. S. Torr, Nicu Sebe

Abstract: We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation, where we translate an input image into another while respecting an external semantic guidance. The proposed SelectionGAN explicitly utilizes the semantic guidance information and consists of two stages. In the first stage, the input image and the con… ▽ More We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation, where we translate an input image into another while respecting an external semantic guidance. The proposed SelectionGAN explicitly utilizes the semantic guidance information and consists of two stages. In the first stage, the input image and the conditional semantic guidance are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using the proposed multi-scale spatial pooling & channel selection module and the multi-channel attention selection module. Moreover, uncertainty maps automatically learned from attention maps are used to guide the pixel loss for better network optimization. Exhaustive experiments on four challenging guided image-to-image translation tasks (face, hand, body, and street view) demonstrate that our SelectionGAN is able to generate significantly better results than the state-of-the-art methods. Meanwhile, the proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis. The code is available at https://github.com/Ha0Tang/SelectionGAN. △ Less

Submitted 6 October, 2022; v1 submitted 3 February, 2020; originally announced February 2020.

Comments: Accepted to TPAMI, an extended version of a paper published in CVPR2019. arXiv admin note: substantial text overlap with arXiv:1904.06807

arXiv:2001.00238 [pdf, other]

Low-Budget Label Query through Domain Alignment Enforcement

Authors: Jurandy Almeida, Cristiano Saltori, Paolo Rota, Nicu Sebe

Abstract: Deep learning revolution happened thanks to the availability of a massive amount of labelled data which have contributed to the development of models with extraordinary inference capabilities. Despite the public availability of a large quantity of datasets, to address specific requirements it is often necessary to generate a new set of labelled data. Quite often, the production of labels is costly… ▽ More Deep learning revolution happened thanks to the availability of a massive amount of labelled data which have contributed to the development of models with extraordinary inference capabilities. Despite the public availability of a large quantity of datasets, to address specific requirements it is often necessary to generate a new set of labelled data. Quite often, the production of labels is costly and sometimes it requires specific know-how to be fulfilled. In this work, we tackle a new problem named low-budget label query that consists in suggesting to the user a small (low budget) set of samples to be labelled, from a completely unlabelled dataset, with the final goal of maximizing the classification accuracy on that dataset. In this work we first improve an Unsupervised Domain Adaptation (UDA) method to better align source and target domains using consistency constraints, reaching the state of the art on a few UDA tasks. Finally, using the previously trained model as reference, we propose a simple yet effective selection method based on uniform sampling of the prediction consistency distribution, which is deterministic and steadily outperforms other baselines as well as competing models on a large variety of publicly available datasets. △ Less

Submitted 29 March, 2020; v1 submitted 1 January, 2020; originally announced January 2020.

arXiv:1912.12215 [pdf, other]

Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

Authors: Hao Tang, Dan Xu, Yan Yan, Philip H. S. Torr, Nicu Sebe

Abstract: In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local cla… ▽ More In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both the global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. The state-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN. △ Less

Submitted 30 March, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: Accepted to CVPR 2020, camera ready (10 pages) + supplementary (18 pages)

arXiv:1912.06931 [pdf, other]

Asymmetric GANs for Image-to-Image Translation

Authors: Hao Tang, Nicu Sebe

Abstract: Existing models for unsupervised image translation with Generative Adversarial Networks (GANs) can learn the mapping from the source domain to the target domain using a cycle-consistency loss. However, these methods always adopt a symmetric network architecture to learn both forward and backward cycles. Because of the task complexity and cycle input difference between the source and target domains… ▽ More Existing models for unsupervised image translation with Generative Adversarial Networks (GANs) can learn the mapping from the source domain to the target domain using a cycle-consistency loss. However, these methods always adopt a symmetric network architecture to learn both forward and backward cycles. Because of the task complexity and cycle input difference between the source and target domains, the inequality in bidirectional forward-backward cycle translations is significant and the amount of information between two domains is different. In this paper, we analyze the limitation of existing symmetric GANs in asymmetric translation tasks, and propose an AsymmetricGAN model with both translation and reconstruction generators of unequal sizes and different parameter-sharing strategy to adapt to the asymmetric need in both unsupervised and supervised image translation tasks. Moreover, the training stage of existing methods has the common problem of model collapse that degrades the quality of the generated images, thus we explore different optimization losses for better training of AsymmetricGAN, making image translation with higher consistency and better stability. Extensive experiments on both supervised and unsupervised generative tasks with 8 datasets show that AsymmetricGAN achieves superior model capacity and better generation performance compared with existing GANs. To the best of our knowledge, we are the first to investigate the asymmetric GAN structure on both unsupervised and supervised image translation tasks. △ Less

Submitted 12 July, 2024; v1 submitted 14 December, 2019; originally announced December 2019.

Comments: Added more information

arXiv:1912.06112 [pdf, other]

doi 10.1109/TIP.2020.3021789

Unified Generative Adversarial Networks for Controllable Image-to-Image Translation

Authors: Hao Tang, Hong Liu, Nicu Sebe

Abstract: We propose a unified Generative Adversarial Network (GAN) for controllable image-to-image translation, i.e., transferring an image from a source to a target domain guided by controllable structures. In addition to conditioning on a reference image, we show how the model can generate images conditioned on controllable structures, e.g., class labels, object keypoints, human skeletons, and scene sema… ▽ More We propose a unified Generative Adversarial Network (GAN) for controllable image-to-image translation, i.e., transferring an image from a source to a target domain guided by controllable structures. In addition to conditioning on a reference image, we show how the model can generate images conditioned on controllable structures, e.g., class labels, object keypoints, human skeletons, and scene semantic maps. The proposed model consists of a single generator and a discriminator taking a conditional image and the target controllable structure as input. In this way, the conditional image can provide appearance information and the controllable structure can provide the structure information for generating the target result. Moreover, our model learns the image-to-image mapping through three novel losses, i.e., color loss, controllable structure guided cycle-consistency loss, and controllable structure guided self-content preserving loss. Also, we present the Fréchet ResNet Distance (FRD) to evaluate the quality of the generated images. Experiments on two challenging image translation tasks, i.e., hand gesture-to-gesture translation and cross-view image translation, show that our model generates convincing results, and significantly outperforms other state-of-the-art methods on both tasks. Meanwhile, the proposed framework is a unified solution, thus it can be applied to solving other controllable structure guided image translation tasks such as landmark guided facial expression translation and keypoint guided person image generation. To the best of our knowledge, we are the first to make one GAN framework work on all such controllable structure guided image translation tasks. Code is available at https://github.com/Ha0Tang/GestureGAN. △ Less

Submitted 2 September, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: Accepted to TIP, an extended version of a paper published in ACM MM 2018. arXiv admin note: substantial text overlap with arXiv:1808.04859

arXiv:1911.11897 [pdf, other]

AttentionGAN: Unpaired Image-to-Image Translation using Attention-Guided Generative Adversarial Networks

Authors: Hao Tang, Hong Liu, Dan Xu, Philip H. S. Torr, Nicu Sebe

Abstract: State-of-the-art methods in image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the… ▽ More State-of-the-art methods in image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this paper, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN. △ Less

Submitted 16 August, 2021; v1 submitted 26 November, 2019; originally announced November 2019.

Comments: Accepted to TNNLS, an extended version of a paper published in IJCNN2019. arXiv admin note: substantial text overlap with arXiv:1903.12296

arXiv:1911.08604 [pdf, ps, other]

doi 10.1016/j.laa.2020.09.034

Reduction of SISO H-infinity Output Feedback Control Problem

Authors: Hayato Waki, Yoshio Ebihara, Noboru Sebe

Abstract: We consider the linear matrix inequality (LMI) problem of $H_\infty$ output feedback control problem for a generalized plant whose control input, measured output, disturbance input, and controlled output are scalar. We provide an explicit form of the optimal value. This form is the unification of some results in the literature of $H_\infty$ performance limitation analysis. To obtain the form of th… ▽ More We consider the linear matrix inequality (LMI) problem of $H_\infty$ output feedback control problem for a generalized plant whose control input, measured output, disturbance input, and controlled output are scalar. We provide an explicit form of the optimal value. This form is the unification of some results in the literature of $H_\infty$ performance limitation analysis. To obtain the form of the optimal value, we focus on the non-uniqueness of perpendicular matrices, which appear in the LMI problem. We use the null vectors of invariant zeros associated with the dynamical system for the expression of the perpendicular matrices. This expression enables us to reduce and simplify the LMI problem. Our approach uses some well-known fundamental tools, e.g., the Schur complement, Lyapunov equation, Sylvester equation, and matrix completion. We use these techniques for the simplification of the LMI problem. Also, we investigate the structure of dual feasible solutions and reduce the size of the dual. This reduction is called a facial reduction in the literature of convex optimization. △ Less

Submitted 19 November, 2019; originally announced November 2019.

Comments: Submitted to a journal

MSC Class: 49K30; 90C22; 93C05; 34K35

arXiv:1911.06849 [pdf, other]

Curriculum Self-Paced Learning for Cross-Domain Object Detection

Authors: Petru Soviany, Radu Tudor Ionescu, Paolo Rota, Nicu Sebe

Abstract: Training (source) domain bias affects state-of-the-art object detectors, such as Faster R-CNN, when applied to new (target) domains. To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e.g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN. On top… ▽ More Training (source) domain bias affects state-of-the-art object detectors, such as Faster R-CNN, when applied to new (target) domains. To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e.g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN. On top of combining Cycle-GAN transformations and self-paced learning in a smart and efficient way, in this paper, we propose a novel self-paced algorithm that learns from easy to hard. Our method is simple and effective, without any overhead during inference. It uses only pseudo-labels for samples taken from the target domain, i.e. the domain adaptation is unsupervised. We conduct experiments on four cross-domain benchmarks, showing better results than the state of the art. We also perform an ablation study demonstrating the utility of each component in our framework. Additionally, we study the applicability of our framework to other object detectors. Furthermore, we compare our difficulty measure with other measures from the related literature, proving that it yields superior results and that it correlates well with the performance metric. △ Less

Submitted 20 January, 2021; v1 submitted 15 November, 2019; originally announced November 2019.

Comments: Accepted for publication in Computer Vision and Image Understanding

arXiv:1911.03884 [pdf, other]

Learning Koopman Operator under Dissipativity Constraints

Authors: Keita Hara, Masaki Inoue, Noboru Sebe

Abstract: This paper addresses a learning problem for nonlinear dynamical systems with incorporating any specified dissipativity property. The nonlinear systems are described by the Koopman operator, which is a linear operator defined on the infinite-dimensional lifted state space. The problem of learning the Koopman operator under specified quadratic dissipativity constraints is formulated and addressed. T… ▽ More This paper addresses a learning problem for nonlinear dynamical systems with incorporating any specified dissipativity property. The nonlinear systems are described by the Koopman operator, which is a linear operator defined on the infinite-dimensional lifted state space. The problem of learning the Koopman operator under specified quadratic dissipativity constraints is formulated and addressed. The learning problem is in a class of the non-convex optimization problem due to nonlinear constraints and is numerically intractable. By applying the change of variable technique and the convex overbounding approximation, the problem is reduced to sequential convex optimization and is solved in a numerically efficient manner. Finally, a numerical simulation is given, where high modeling accuracy achieved by the proposed approach including the specified dissipativity is demonstrated. △ Less

Submitted 10 November, 2019; originally announced November 2019.

arXiv:1909.07667 [pdf, other]

Progressive Fusion for Unsupervised Binocular Depth Estimation using Cycled Networks

Authors: Andrea Pilzer, Stéphane Lathuilière, Dan Xu, Mihai Marian Puscas, Elisa Ricci, Nicu Sebe

Abstract: Recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance. However, they require costly ground truth annotations during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps. We introduce a new network architecture, named Progressive Fusion Network (PFN), that is… ▽ More Recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance. However, they require costly ground truth annotations during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps. We introduce a new network architecture, named Progressive Fusion Network (PFN), that is specifically designed for binocular stereo depth estimation. This network is based on a multi-scale refinement strategy that combines the information provided by both stereo views. In addition, we propose to stack twice this network in order to form a cycle. This cycle approach can be interpreted as a form of data-augmentation since, at training time, the network learns both from the training set images (in the forward half-cycle) but also from the synthesized images (in the backward half-cycle). The architecture is jointly trained with adversarial learning. Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction. △ Less

Submitted 17 September, 2019; originally announced September 2019.

Comments: Accepted to TPAMI (SI RGB-D Vision), code https://github.com/andrea-pilzer/PFN-depth

arXiv:1908.05794 [pdf, other]

Structured Coupled Generative Adversarial Networks for Unsupervised Monocular Depth Estimation

Authors: Mihai Marian Puscas, Dan Xu, Andrea Pilzer, Nicu Sebe

Abstract: Inspired by the success of adversarial learning, we propose a new end-to-end unsupervised deep learning framework for monocular depth estimation consisting of two Generative Adversarial Networks (GAN), deeply coupled with a structured Conditional Random Field (CRF) model. The two GANs aim at generating distinct and complementary disparity maps and at improving the generation quality via exploiting… ▽ More Inspired by the success of adversarial learning, we propose a new end-to-end unsupervised deep learning framework for monocular depth estimation consisting of two Generative Adversarial Networks (GAN), deeply coupled with a structured Conditional Random Field (CRF) model. The two GANs aim at generating distinct and complementary disparity maps and at improving the generation quality via exploiting the adversarial learning strategy. The deep CRF coupling model is proposed to fuse the generative and discriminative outputs from the dual GAN nets. As such, the model implicitly constructs mutual constraints on the two network branches and between the generator and discriminator. This facilitates the optimization of the whole network for better disparity generation. Extensive experiments on the KITTI, Cityscapes, and Make3D datasets clearly demonstrate the effectiveness of the proposed approach and show superior performance compared to state of the art methods. The code and models are available at https://github.com/mihaipuscas/ 3dv---coupled-crf-disparity. △ Less

Submitted 15 August, 2019; originally announced August 2019.

Comments: Accepted at 3DV 2019 as ORAL

arXiv:1908.00999 [pdf, other]

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Authors: Hao Tang, Dan Xu, Gaowen Liu, Wei Wang, Nicu Sebe, Yan Yan

Abstract: In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C$^2$GAN) for the task of keypoint-guided image generation. The proposed C$^2$GAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. C$^2$GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Bo… ▽ More In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C$^2$GAN) for the task of keypoint-guided image generation. The proposed C$^2$GAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. C$^2$GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involving in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network. Extensive experimental results on two publicly available datasets, i.e., Radboud Faces and Market-1501, demonstrate that our approach is effective to generate more photo-realistic images compared with state-of-the-art models. △ Less

Submitted 15 April, 2020; v1 submitted 2 August, 2019; originally announced August 2019.

Comments: 9 pages, 8 figures, accepted to ACM MM 2019

Journal ref: ACM MM 2019

arXiv:1907.09679 [pdf, other]

doi 10.1109/IJCNN.2019.8852086

Effortless Deep Training for Traffic Sign Detection Using Templates and Arbitrary Natural Images

Authors: Lucas Tabelini Torres, Thiago M. Paixão, Rodrigo F. Berriel, Alberto F. De Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos

Abstract: Deep learning has been successfully applied to several problems related to autonomous driving. Often, these solutions rely on large networks that require databases of real image samples of the problem (i.e., real world) for proper training. The acquisition of such real-world data sets is not always possible in the autonomous driving context, and sometimes their annotation is not feasible (e.g., ta… ▽ More Deep learning has been successfully applied to several problems related to autonomous driving. Often, these solutions rely on large networks that require databases of real image samples of the problem (i.e., real world) for proper training. The acquisition of such real-world data sets is not always possible in the autonomous driving context, and sometimes their annotation is not feasible (e.g., takes too long or is too expensive). Moreover, in many tasks, there is an intrinsic data imbalance that most learning-based methods struggle to cope with. It turns out that traffic sign detection is a problem in which these three issues are seen altogether. In this work, we propose a novel database generation method that requires only (i) arbitrary natural images, i.e., requires no real image from the domain of interest, and (ii) templates of the traffic signs, i.e., templates synthetically created to illustrate the appearance of the category of a traffic sign. The effortlessly generated training database is shown to be effective for the training of a deep detector (such as Faster R-CNN) on German traffic signs, achieving 95.66% of mAP on average. In addition, the proposed method is able to detect traffic signs with an average precision, recall and F1-score of about 94%, 91% and 93%, respectively. The experiments surprisingly show that detectors can be trained with simple data generation methods and without problem domain data for the background, which is in the opposite direction of the common sense for deep learning. △ Less

Submitted 22 July, 2019; originally announced July 2019.

arXiv:1907.08719 [pdf, other]

doi 10.1109/IJCNN.2019.8852008

Cross-Domain Car Detection Using Unsupervised Image-to-Image Translation: From Day to Night

Authors: Vinicius F. Arruda, Thiago M. Paixão, Rodrigo F. Berriel, Alberto F. De Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos

Abstract: Deep learning techniques have enabled the emergence of state-of-the-art models to address object detection tasks. However, these techniques are data-driven, delegating the accuracy to the training dataset which must resemble the images in the target task. The acquisition of a dataset involves annotating images, an arduous and expensive process, generally requiring time and manual effort. Thus, a c… ▽ More Deep learning techniques have enabled the emergence of state-of-the-art models to address object detection tasks. However, these techniques are data-driven, delegating the accuracy to the training dataset which must resemble the images in the target task. The acquisition of a dataset involves annotating images, an arduous and expensive process, generally requiring time and manual effort. Thus, a challenging scenario arises when the target domain of application has no annotated dataset available, making tasks in such situation to lean on a training dataset of a different domain. Sharing this issue, object detection is a vital task for autonomous vehicles where the large amount of driving scenarios yields several domains of application requiring annotated data for the training process. In this work, a method for training a car detection system with annotated data from a source domain (day images) without requiring the image annotations of the target domain (night images) is presented. For that, a model based on Generative Adversarial Networks (GANs) is explored to enable the generation of an artificial dataset with its respective annotations. The artificial dataset (fake dataset) is created translating images from day-time domain to night-time domain. The fake dataset, which comprises annotated images of only the target domain (night images), is then used to train the car detector model. Experimental results showed that the proposed method achieved significant and consistent improvements, including the increasing by more than 10% of the detection performance when compared to the training with only the available annotated data (i.e., day images). △ Less

Submitted 19 July, 2019; originally announced July 2019.

Comments: 8 pages, 8 figures, https://github.com/viniciusarruda/cross-domain-car-detection and accepted at IJCNN 2019

arXiv:1907.05916 [pdf, other]

doi 10.1145/3343031.3351020

Gesture-to-Gesture Translation in the Wild via Category-Independent Conditional Maps

Authors: Yahui Liu, Marco De Nadai, Gloria Zen, Nicu Sebe, Bruno Lepri

Abstract: Recent works have shown Generative Adversarial Networks (GANs) to be particularly effective in image-to-image translations. However, in tasks such as body pose and hand gesture translation, existing methods usually require precise annotations, e.g. key-points or skeletons, which are time-consuming to draw. In this work, we propose a novel GAN architecture that decouples the required annotations in… ▽ More Recent works have shown Generative Adversarial Networks (GANs) to be particularly effective in image-to-image translations. However, in tasks such as body pose and hand gesture translation, existing methods usually require precise annotations, e.g. key-points or skeletons, which are time-consuming to draw. In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture. Our architecture synthesizes the target gesture while preserving the background context, thus effectively dealing with gesture translation in the wild. To this aim, we use an attention module and a rolling guidance approach, which loops the generated images back into the network and produces higher quality images compared to competing works. Thus, our GAN learns to generate new images from simple annotations without requiring key-points or skeleton labels. Results on two public datasets show that our method outperforms state of the art approaches both quantitatively and qualitatively. To the best of our knowledge, no work so far has addressed the gesture-to-gesture translation in the wild by requiring user-friendly annotations. △ Less

Submitted 31 July, 2019; v1 submitted 12 July, 2019; originally announced July 2019.

Comments: 15 pages, 12 figures

Journal ref: 27th ACM International Conference on Multimedia, 2019

arXiv:1906.03525 [pdf, other]

Pattern-Affinitive Propagation across Depth, Surface Normal and Semantic Segmentation

Authors: Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, Jian Yang

Abstract: In this paper, we propose a novel Pattern-Affinitive Propagation (PAP) framework to jointly predict depth, surface normal and semantic segmentation. The motivation behind it comes from the statistic observation that pattern-affinitive pairs recur much frequently across different tasks as well as within a task. Thus, we can conduct two types of propagations, cross-task propagation and task-specific… ▽ More In this paper, we propose a novel Pattern-Affinitive Propagation (PAP) framework to jointly predict depth, surface normal and semantic segmentation. The motivation behind it comes from the statistic observation that pattern-affinitive pairs recur much frequently across different tasks as well as within a task. Thus, we can conduct two types of propagations, cross-task propagation and task-specific propagation, to adaptively diffuse those similar patterns. The former integrates cross-task affinity patterns to adapt to each task therein through the calculation on non-local relationships. Next the latter performs an iterative diffusion in the feature space so that the cross-task affinity patterns can be widely-spread within the task. Accordingly, the learning of each task can be regularized and boosted by the complementary task-level affinities. Extensive experiments demonstrate the effectiveness and the superiority of our method on the joint three tasks. Meanwhile, we achieve the state-of-the-art or competitive results on the three related datasets, NYUD-v2, SUN-RGBD and KITTI. △ Less

Submitted 8 June, 2019; originally announced June 2019.

Comments: 10 pages, 9 figures, CVPR 2019

arXiv:1906.00805 [pdf, other]

GazeCorrection:Self-Guided Eye Manipulation in the wild using Self-Supervised Generative Adversarial Networks

Authors: Jichao Zhang, Meng Sun, Jingjing Chen, Hao Tang, Yan Yan, Xueying Qin, Nicu Sebe

Abstract: Gaze correction aims to redirect the person's gaze into the camera by manipulating the eye region, and it can be considered as a specific image resynthesis problem. Gaze correction has a wide range of applications in real life, such as taking a picture with staring at the camera. In this paper, we propose a novel method that is based on the inpainting model to learn from the face image to fill in… ▽ More Gaze correction aims to redirect the person's gaze into the camera by manipulating the eye region, and it can be considered as a specific image resynthesis problem. Gaze correction has a wide range of applications in real life, such as taking a picture with staring at the camera. In this paper, we propose a novel method that is based on the inpainting model to learn from the face image to fill in the missing eye regions with new contents representing corrected eye gaze. Moreover, our model does not require the training dataset labeled with the specific head pose and eye angle information, thus, the training data is easy to collect. To retain the identity information of the eye region in the original input, we propose a self-guided pretrained model to learn the angle-invariance feature. Experiments show our model achieves very compelling gaze-corrected results in the wild dataset which is collected from the website and will be introduced in details. Code is available at https://github.com/zhangqianhui/GazeCorrection. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1905.10784 [pdf, other]

doi 10.1109/TAFFC.2021.3124142

Impact of facial landmark localization on facial expression recognition

Authors: Romain Belmonte, Benjamin Allaert, Pierre Tirilly, Ioan Marius Bilasco, Chaabane Djeraba, Nicu Sebe

Abstract: Although facial landmark localization (FLL) approaches are becoming increasingly accurate for characterizing facial regions, one question remains unanswered: what is the impact of these approaches on subsequent related tasks? In this paper, the focus is put on facial expression recognition (FER), where facial landmarks are used for face registration, which is a common usage. Since the most used da… ▽ More Although facial landmark localization (FLL) approaches are becoming increasingly accurate for characterizing facial regions, one question remains unanswered: what is the impact of these approaches on subsequent related tasks? In this paper, the focus is put on facial expression recognition (FER), where facial landmarks are used for face registration, which is a common usage. Since the most used datasets for facial landmark localization do not allow for a proper measurement of performance according to the different difficulties (e.g., pose, expression, illumination, occlusion, motion blur), we also quantify the performance of recent approaches in the presence of head pose variations and facial expressions. Finally, a study of the impact of these approaches on FER is conducted. We show that the landmark accuracy achieved so far optimizing the conventional Euclidean distance does not necessarily guarantee a gain in performance for FER. To deal with this issue, we propose a new evaluation metric for FLL adapted to FER. △ Less

Submitted 19 July, 2021; v1 submitted 26 May, 2019; originally announced May 2019.

arXiv:1905.06252 [pdf, other]

doi 10.1007/978-3-030-30642-7_20

Regularized Evolutionary Algorithm for Dynamic Neural Topology Search

Authors: Cristiano Saltori, Subhankar Roy, Nicu Sebe, Giovanni Iacca

Abstract: Designing neural networks for object recognition requires considerable architecture engineering. As a remedy, neuro-evolutionary network architecture search, which automatically searches for optimal network architectures using evolutionary algorithms, has recently become very popular. Although very effective, evolutionary algorithms rely heavily on having a large population of individuals (i.e., n… ▽ More Designing neural networks for object recognition requires considerable architecture engineering. As a remedy, neuro-evolutionary network architecture search, which automatically searches for optimal network architectures using evolutionary algorithms, has recently become very popular. Although very effective, evolutionary algorithms rely heavily on having a large population of individuals (i.e., network architectures) and is therefore memory expensive. In this work, we propose a Regularized Evolutionary Algorithm with low memory footprint to evolve a dynamic image classifier. In details, we introduce novel custom operators that regularize the evolutionary process of a micro-population of 10 individuals. We conduct experiments on three different digits datasets (MNIST, USPS, SVHN) and show that our evolutionary method obtains competitive results with the current state-of-the-art. △ Less

Submitted 19 August, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

arXiv:1905.06242 [pdf, other]

doi 10.1109/ICCV.2019.00047

Budget-Aware Adapters for Multi-Domain Learning

Authors: Rodrigo Berriel, Stéphane Lathuilière, Moin Nabi, Tassilo Klein, Thiago Oliveira-Santos, Nicu Sebe, Elisa Ricci

Abstract: Multi-Domain Learning (MDL) refers to the problem of learning a set of models derived from a common deep architecture, each one specialized to perform a task in a certain domain (e.g., photos, sketches, paintings). This paper tackles MDL with a particular interest in obtaining domain-specific models with an adjustable budget in terms of the number of network parameters and computational complexity… ▽ More Multi-Domain Learning (MDL) refers to the problem of learning a set of models derived from a common deep architecture, each one specialized to perform a task in a certain domain (e.g., photos, sketches, paintings). This paper tackles MDL with a particular interest in obtaining domain-specific models with an adjustable budget in terms of the number of network parameters and computational complexity. Our intuition is that, as in real applications the number of domains and tasks can be very large, an effective MDL approach should not only focus on accuracy but also on having as few parameters as possible. To implement this idea we derive specialized deep models for each domain by adapting a pre-trained architecture but, differently from other methods, we propose a novel strategy to automatically adjust the computational complexity of the network. To this aim, we introduce Budget-Aware Adapters that select the most relevant feature channels to better handle data from a novel domain. Some constraints on the number of active switches are imposed in order to obtain a network respecting the desired complexity budget. Experimentally, we show that our approach leads to recognition accuracy competitive with state-of-the-art approaches but with much lighter networks both in terms of storage and computation. △ Less

Submitted 8 December, 2020; v1 submitted 15 May, 2019; originally announced May 2019.

Comments: ICCV 2019

arXiv:1905.05416 [pdf, other]

Expression Conditional GAN for Facial Expression-to-Expression Translation

Authors: Hao Tang, Wei Wang, Songsong Wu, Xinya Chen, Dan Xu, Nicu Sebe, Yan Yan

Abstract: In this paper, we focus on the facial expression translation task and propose a novel Expression Conditional GAN (ECGAN) which can learn the mapping from one image domain to another one based on an additional expression attribute. The proposed ECGAN is a generic framework and is applicable to different expression generation tasks where specific facial expression can be easily controlled by the con… ▽ More In this paper, we focus on the facial expression translation task and propose a novel Expression Conditional GAN (ECGAN) which can learn the mapping from one image domain to another one based on an additional expression attribute. The proposed ECGAN is a generic framework and is applicable to different expression generation tasks where specific facial expression can be easily controlled by the conditional attribute label. Besides, we introduce a novel face mask loss to reduce the influence of background changing. Moreover, we propose an entire framework for facial expression generation and recognition in the wild, which consists of two modules, i.e., generation and recognition. Finally, we evaluate our framework on several public face datasets in which the subjects have different races, illumination, occlusion, pose, color, content and background conditions. Even though these datasets are very diverse, both the qualitative and quantitative results demonstrate that our approach is able to generate facial expressions accurately and robustly. △ Less

Submitted 14 May, 2019; originally announced May 2019.

Comments: 5 pages, 5 figures, accepted to ICIP 2019

arXiv:1905.02655 [pdf, other]

Attention-based Fusion for Multi-source Human Image Generation

Authors: Stéphane Lathuilière, Enver Sangineto, Aliaksandr Siarohin, Nicu Sebe

Abstract: We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which sele… ▽ More We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which selects relevant information from different source image regions, avoiding the necessity to build specific generators for each specific cardinality of X. The empirical evaluation of our method shows the practical interest of addressing the person-image generation problem in a multi-source setting. △ Less

Submitted 7 May, 2019; originally announced May 2019.

Comments: 10 pages

arXiv:1905.00007 [pdf, other]

Appearance and Pose-Conditioned Human Image Generation using Deformable GANs

Authors: Aliaksandr Siarohin, Stéphane Lathuilière, Enver Sangineto, Nicu Sebe

Abstract: In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa. In order to deal with pixel-to-pixel misalignments caused by the pose differ… ▽ More In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa. In order to deal with pixel-to-pixel misalignments caused by the pose differences between P(xa) and P(xb), we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L1 and L2 losses in order to match the details of the generated image with the target image. Quantitative and qualitative results, using common datasets and protocols recently proposed for this task, show that our approach is competitive with respect to the state of the art. Moreover, we conduct an extensive evaluation using off-the-shell person re-identification (Re-ID) systems trained with person-generation based augmented data, which is one of the main important applications for this task. Our experiments show that our Deformable GANs can significantly boost the Re-ID accuracy and are even better than data-augmentation methods specifically trained using Re-ID losses. △ Less

Submitted 14 October, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

Comments: To appear on IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1801.00055

arXiv:1904.08462 [pdf, other]

Online Adaptation through Meta-Learning for Stereo Depth Estimation

Authors: Zhenyu Zhang, Stéphane Lathuilière, Andrea Pilzer, Nicu Sebe, Elisa Ricci, Jian Yang

Abstract: In this work, we tackle the problem of online adaptation for stereo depth estimation, that consists in continuously adapting a deep network to a target video recordedin an environment different from that of the source training set. To address this problem, we propose a novel Online Meta-Learning model with Adaption (OMLA). Our proposal is based on two main contributions. First, to reducethe domain… ▽ More In this work, we tackle the problem of online adaptation for stereo depth estimation, that consists in continuously adapting a deep network to a target video recordedin an environment different from that of the source training set. To address this problem, we propose a novel Online Meta-Learning model with Adaption (OMLA). Our proposal is based on two main contributions. First, to reducethe domain-shift between source and target feature distributions we introduce an online feature alignment procedurederived from Batch Normalization. Second, we devise a meta-learning approach that exploits feature alignment forfaster convergence in an online learning setting. Additionally, we propose a meta-pre-training algorithm in order toobtain initial network weights on the source dataset whichfacilitate adaptation on future data streams. Experimentally, we show that both OMLA and meta-pre-training helpthe model to adapt faster to a new environment. Our proposal is evaluated on the wellestablished KITTI dataset,where we show that our online method is competitive withstate of the art algorithms trained in a batch setting. △ Less

Submitted 17 April, 2019; originally announced April 2019.

Comments: 12 pages

arXiv:1904.06807 [pdf, other]

Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation

Authors: Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J. Corso, Yan Yan

Abstract: Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN exp… ▽ More Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN explicitly utilizes the semantic information and consists of two stages. In the first stage, the condition image and the target semantic map are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using a multi-channel attention selection mechanism. Moreover, uncertainty maps automatically learned from attentions are used to guide the pixel loss for better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top datasets show that our model is able to generate significantly better results than the state-of-the-art methods. The source code, data and trained models are available at https://github.com/Ha0Tang/SelectionGAN. △ Less

Submitted 16 April, 2019; v1 submitted 14 April, 2019; originally announced April 2019.

Comments: 20 pages, 16 figures, accepted to CVPR 2019 as an oral paper

Journal ref: CVPR 2019

arXiv:1904.01258 [pdf, other]

doi 10.1109/LGRS.2020.2974629

Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images

Authors: Subhankar Roy, Enver Sangineto, Begüm Demir, Nicu Sebe

Abstract: Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information content of RS images. To overcome this problem, in… ▽ More Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information content of RS images. To overcome this problem, in this paper we introduce a metric-learning based hashing network, which learns: 1) a semantic-based metric space for effective feature representation; and 2) compact binary hash codes for fast archive search. Our network considers an interplay of multiple loss functions that allows to jointly learn a metric based semantic space facilitating similar images to be clustered together in that target space and at the same time producing compact final activations that lose negligible information when binarized. Experiments carried out on two benchmark RS archives point out that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS. △ Less

Submitted 6 January, 2021; v1 submitted 2 April, 2019; originally announced April 2019.

Comments: Accepted to IEEE Geoscience and Remote Sensing Letters. For code visit: https://github.com/MLEnthusiast/MHCLN

arXiv:1903.12296 [pdf, other]

Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation

Authors: Hao Tang, Dan Xu, Nicu Sebe, Yan Yan

Abstract: The state-of-the-art approaches in Generative Adversarial Networks (GANs) are able to learn a mapping function from one image domain to another with unpaired image data. However, these methods often produce artifacts and can only be able to convert low-level information, but fail to transfer high-level semantic part of images. The reason is mainly that generators do not have the ability to detect… ▽ More The state-of-the-art approaches in Generative Adversarial Networks (GANs) are able to learn a mapping function from one image domain to another with unpaired image data. However, these methods often produce artifacts and can only be able to convert low-level information, but fail to transfer high-level semantic part of images. The reason is mainly that generators do not have the ability to detect the most discriminative semantic part of images, which thus makes the generated images with low-quality. To handle the limitation, in this paper we propose a novel Attention-Guided Generative Adversarial Network (AGGAN), which can detect the most discriminative semantic object and minimize changes of unwanted part for semantic manipulation problems without using extra data and models. The attention-guided generators in AGGAN are able to produce attention masks via a built-in attention mechanism, and then fuse the input image with the attention mask to obtain a target image with high-quality. Moreover, we propose a novel attention-guided discriminator which only considers attended regions. The proposed AGGAN is trained by an end-to-end fashion with an adversarial loss, cycle-consistency loss, pixel loss and attention loss. Both qualitative and quantitative results demonstrate that our approach is effective to generate sharper and more accurate images than existing models. The code is available at https://github.com/Ha0Tang/AttentionGAN. △ Less

Submitted 27 August, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: 8 pages, 7 figures, Accepted to IJCNN 2019

Journal ref: IJCNN 2019

arXiv:1903.04202 [pdf, other]

Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation

Authors: Andrea Pilzer, Stéphane Lathuilière, Nicu Sebe, Elisa Ricci

Abstract: Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting groun… ▽ More Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a \emph{student} network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating the opposite disparity. A third network exploits the inconsistency between the original and the reconstructed input frame in order to output a refined depth map. Finally, knowledge distillation is exploited, such as to transfer information from the refinement network to the student. Our extensive experimental evaluation demonstrate the effectiveness of the proposed framework which outperforms state of the art unsupervised methods on the KITTI benchmark. △ Less

Submitted 20 April, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted at CVPR2019

arXiv:1903.03215 [pdf, other]

Unsupervised Domain Adaptation using Feature-Whitening and Consensus Loss

Authors: Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, Elisa Ricci

Abstract: A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alignment layers which implement feature whitening for… ▽ More A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alignment layers which implement feature whitening for the purpose of matching source and target feature distributions. Additionally, we leverage the unlabeled target data by proposing the Min-Entropy Consensus loss, which regularizes training while avoiding the adoption of many user-defined hyper-parameters. We report results on publicly available datasets, considering both digit classification and object recognition tasks. We show that, in most of our experiments, our approach improves upon previous methods, setting new state-of-the-art performances. △ Less

Submitted 16 February, 2020; v1 submitted 7 March, 2019; originally announced March 2019.

Comments: CVPR 2019

arXiv:1901.09774 [pdf, other]

Attribute-Guided Sketch Generation

Authors: Hao Tang, Xinya Chen, Wei Wang, Dan Xu, Jason J. Corso, Nicu Sebe, Yan Yan

Abstract: Facial attributes are important since they provide a detailed description and determine the visual appearance of human faces. In this paper, we aim at converting a face image to a sketch while simultaneously generating facial attributes. To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generato… ▽ More Facial attributes are important since they provide a detailed description and determine the visual appearance of human faces. In this paper, we aim at converting a face image to a sketch while simultaneously generating facial attributes. To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generators and discriminators, one of which is used to generate faces with attributes while the other one is employed for image-to-sketch translation. The two generators form a W-shaped network (W-net) and they are trained jointly with a weight-sharing constraint. Additionally, we also propose two novel discriminators, the residual one focusing on attribute generation and the triplex one helping to generate realistic looking sketches. To validate our model, we have created a new large dataset with 8,804 images, named the Attribute Face Photo & Sketch (AFPS) dataset which is the first dataset containing attributes associated to face sketch images. The experimental results demonstrate that the proposed network (i) generates more photo-realistic faces with sharper facial attributes than baselines and (ii) has good generalization capability on different generative tasks. △ Less

Submitted 14 April, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

Comments: 7 pages, 6 figures, accepted to FG 2019

arXiv:1901.04622 [pdf, other]

Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion

Authors: Hao Tang, Hong Liu, Wei Xiao, Nicu Sebe

Abstract: Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture recognition remains an open problem, since the existing methods have not well balanced the performance and the efficiency simultaneously. To bridge it, this work combines… ▽ More Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture recognition remains an open problem, since the existing methods have not well balanced the performance and the efficiency simultaneously. To bridge it, this work combines image entropy and density clustering to exploit the key frames from hand gesture video for further feature extraction, which can improve the efficiency of recognition. Moreover, a feature fusion strategy is also proposed to further improve feature representation, which elevates the performance of recognition. To validate our approach in a "wild" environment, we also introduce two new datasets called HandGesture and Action3D datasets. Experiments consistently demonstrate that our strategy achieves competitive results on Northwestern University, Cambridge, HandGesture and Action3D hand gesture datasets. Our code and datasets will release at https://github.com/Ha0Tang/HandGestureRecognition. △ Less

Submitted 14 January, 2019; originally announced January 2019.

Comments: 11 pages, 3 figures, accepted to NeuroComputing

arXiv:1901.04604 [pdf, other]

Dual Generator Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Authors: Hao Tang, Dan Xu, Wei Wang, Yan Yan, Nicu Sebe

Abstract: State-of-the-art methods for image-to-image translation with Generative Adversarial Networks (GANs) can learn a mapping from one domain to another domain using unpaired image data. However, these methods require the training of one specific model for every pair of image domains, which limits the scalability in dealing with more than two image domains. In addition, the training stage of these metho… ▽ More State-of-the-art methods for image-to-image translation with Generative Adversarial Networks (GANs) can learn a mapping from one domain to another domain using unpaired image data. However, these methods require the training of one specific model for every pair of image domains, which limits the scalability in dealing with more than two image domains. In addition, the training stage of these methods has the common problem of model collapse that degrades the quality of the generated images. To tackle these issues, we propose a Dual Generator Generative Adversarial Network (G$^2$GAN), which is a robust and scalable approach allowing to perform unpaired image-to-image translation for multiple domains using only dual generators within a single model. Moreover, we explore different optimization losses for better training of G$^2$GAN, and thus make unpaired image-to-image translation with higher consistency and better stability. Extensive experiments on six publicly available datasets with different scenarios, i.e., architectural buildings, seasons, landscape and human faces, demonstrate that the proposed G$^2$GAN achieves superior model capacity and better generation performance comparing with existing image-to-image translation GAN models. △ Less

Submitted 14 January, 2019; originally announced January 2019.

Comments: 16 pages, 7 figures, accepted to ACCV 2018

arXiv:1901.01868 [pdf, other]

Low-Shot Learning from Imaginary 3D Model

Authors: Frederik Pahde, Mihai Puscas, Jannik Wolff, Tassilo Klein, Nicu Sebe, Moin Nabi

Abstract: Since the advent of deep learning, neural networks have demonstrated remarkable results in many visual recognition tasks, constantly pushing the limits. However, the state-of-the-art approaches are largely unsuitable in scarce data regimes. To address this shortcoming, this paper proposes employing a 3D model, which is derived from training images. Such a model can then be used to hallucinate nove… ▽ More Since the advent of deep learning, neural networks have demonstrated remarkable results in many visual recognition tasks, constantly pushing the limits. However, the state-of-the-art approaches are largely unsuitable in scarce data regimes. To address this shortcoming, this paper proposes employing a 3D model, which is derived from training images. Such a model can then be used to hallucinate novel viewpoints and poses for the scarce samples of the few-shot learning scenario. A self-paced learning approach allows for the selection of a diverse set of high-quality images, which facilitates the training of a classifier. The performance of the proposed approach is showcased on the fine-grained CUB-200-2011 dataset in a few-shot setting and significantly improves our baseline accuracy. △ Less

Submitted 4 January, 2019; originally announced January 2019.

Comments: To appear at WACV 2019. arXiv admin note: text overlap with arXiv:1811.09192

arXiv:1812.11771 [pdf, other]

Predicting Group Cohesiveness in Images

Authors: Shreya Ghosh, Abhinav Dhall, Nicu Sebe, Tom Gedeon

Abstract: The cohesiveness of a group is an essential indicator of the emotional state, structure and success of a group of people. We study the factors that influence the perception of group-level cohesion and propose methods for estimating the human-perceived cohesion on the group cohesiveness scale. In order to identify the visual cues (attributes) for cohesion, we conducted a user survey. Image analysis… ▽ More The cohesiveness of a group is an essential indicator of the emotional state, structure and success of a group of people. We study the factors that influence the perception of group-level cohesion and propose methods for estimating the human-perceived cohesion on the group cohesiveness scale. In order to identify the visual cues (attributes) for cohesion, we conducted a user survey. Image analysis is performed at a group-level via a multi-task convolutional neural network. For analyzing the contribution of facial expressions of the group members for predicting the Group Cohesion Score (GCS), a capsule network is explored. We add GCS to the Group Affect database and propose the `GAF-Cohesion database'. The proposed model performs well on the database and is able to achieve near human-level performance in predicting a group's cohesion score. It is interesting to note that group cohesion as an attribute, when jointly trained for group-level emotion prediction, helps in increasing the performance for the later task. This suggests that group-level emotion and cohesion are correlated. △ Less

Submitted 7 April, 2019; v1 submitted 31 December, 2018; originally announced December 2018.

arXiv:1812.08861 [pdf, other]

Animating Arbitrary Objects via Deep Motion Transfer

Authors: Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

Abstract: This paper introduces a novel deep learning framework for image animation. Given an input image with a target object and a driving video sequence depicting a moving object, our framework generates a video in which the target object is animated according to the driving sequence. This is achieved through a deep architecture that decouples appearance and motion information. Our framework consists of… ▽ More This paper introduces a novel deep learning framework for image animation. Given an input image with a target object and a driving video sequence depicting a moving object, our framework generates a video in which the target object is animated according to the driving sequence. This is achieved through a deep architecture that decouples appearance and motion information. Our framework consists of three main modules: (i) a Keypoint Detector unsupervisely trained to extract object keypoints, (ii) a Dense Motion prediction network for generating dense heatmaps from sparse keypoints, in order to better encode motion information and (iii) a Motion Transfer Network, which uses the motion heatmaps and appearance information extracted from the input image to synthesize the output frames. We demonstrate the effectiveness of our method on several benchmark datasets, spanning a wide variety of object appearances, and show that our approach outperforms state-of-the-art image animation and video generation methods. Our source code is publicly available. △ Less

Submitted 30 August, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

Comments: CVPR-2019 (oral)

arXiv:1812.00717 [pdf, other]

Enhancing Perceptual Attributes with Bayesian Style Generation

Authors: Aliaksandr Siarohin, Gloria Zen, Nicu Sebe, Elisa Ricci

Abstract: Deep learning has brought an unprecedented progress in computer vision and significant advances have been made in predicting subjective properties inherent to visual data (e.g., memorability, aesthetic quality, evoked emotions, etc.). Recently, some research works have even proposed deep learning approaches to modify images such as to appropriately alter these properties. Following this research l… ▽ More Deep learning has brought an unprecedented progress in computer vision and significant advances have been made in predicting subjective properties inherent to visual data (e.g., memorability, aesthetic quality, evoked emotions, etc.). Recently, some research works have even proposed deep learning approaches to modify images such as to appropriately alter these properties. Following this research line, this paper introduces a novel deep learning framework for synthesizing images in order to enhance a predefined perceptual attribute. Our approach takes as input a natural image and exploits recent models for deep style transfer and generative adversarial networks to change its style in order to modify a specific high-level attribute. Differently from previous works focusing on enhancing a specific property of a visual content, we propose a general framework and demonstrate its effectiveness in two use cases, i.e. increasing image memorability and generating scary pictures. We evaluate the proposed approach on publicly available benchmarks, demonstrating its advantages over state of the art methods. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: ACCV-2018

arXiv:1809.04185 [pdf, other]

Deep Micro-Dictionary Learning and Coding Network

Authors: Hao Tang, Heng Wei, Wei Xiao, Wei Wang, Dan Xu, Yan Yan, Nicu Sebe

Abstract: In this paper, we propose a novel Deep Micro-Dictionary Learning and Coding Network (DDLCN). DDLCN has most of the standard deep learning layers (pooling, fully, connected, input/output, etc.) but the main difference is that the fundamental convolutional layers are replaced by novel compound dictionary learning and coding layers. The dictionary learning layer learns an over-complete dictionary for… ▽ More In this paper, we propose a novel Deep Micro-Dictionary Learning and Coding Network (DDLCN). DDLCN has most of the standard deep learning layers (pooling, fully, connected, input/output, etc.) but the main difference is that the fundamental convolutional layers are replaced by novel compound dictionary learning and coding layers. The dictionary learning layer learns an over-complete dictionary for the input training data. At the deep coding layer, a locality constraint is added to guarantee that the activated dictionary bases are close to each other. Next, the activated dictionary atoms are assembled together and passed to the next compound dictionary learning and coding layers. In this way, the activated atoms in the first layer can be represented by the deeper atoms in the second dictionary. Intuitively, the second dictionary is designed to learn the fine-grained components which are shared among the input dictionary atoms. In this way, a more informative and discriminative low-level representation of the dictionary atoms can be obtained. We empirically compare the proposed DDLCN with several dictionary learning methods and deep learning architectures. The experimental results on four popular benchmark datasets demonstrate that the proposed DDLCN achieves competitive results compared with state-of-the-art approaches. △ Less

Submitted 25 December, 2018; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: 10 page, 8 figures, accepted to WACV 2019

arXiv:1808.04859 [pdf, other]

GestureGAN for Hand Gesture-to-Gesture Translation in the Wild

Authors: Hao Tang, Wei Wang, Dan Xu, Yan Yan, Nicu Sebe

Abstract: Hand gesture-to-gesture translation in the wild is a challenging task since hand gestures can have arbitrary poses, sizes, locations and self-occlusions. Therefore, this task requires a high-level understanding of the mapping between the input source gesture and the output target gesture. To tackle this problem, we propose a novel hand Gesture Generative Adversarial Network (GestureGAN). GestureGA… ▽ More Hand gesture-to-gesture translation in the wild is a challenging task since hand gestures can have arbitrary poses, sizes, locations and self-occlusions. Therefore, this task requires a high-level understanding of the mapping between the input source gesture and the output target gesture. To tackle this problem, we propose a novel hand Gesture Generative Adversarial Network (GestureGAN). GestureGAN consists of a single generator $G$ and a discriminator $D$, which takes as input a conditional hand image and a target hand skeleton image. GestureGAN utilizes the hand skeleton information explicitly, and learns the gesture-to-gesture mapping through two novel losses, the color loss and the cycle-consistency loss. The proposed color loss handles the issue of "channel pollution" while back-propagating the gradients. In addition, we present the Fréchet ResNet Distance (FRD) to evaluate the quality of generated images. Extensive experiments on two widely used benchmark datasets demonstrate that the proposed GestureGAN achieves state-of-the-art performance on the unconstrained hand gesture-to-gesture translation task. Meanwhile, the generated images are in high-quality and are photo-realistic, allowing them to be used as data augmentation to improve the performance of a hand gesture classifier. Our model and code are available at https://github.com/Ha0Tang/GestureGAN. △ Less

Submitted 19 July, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

Comments: 9 pages, 7 figures, accepted to ACM MM 2018 as an oral paper, fix typos

arXiv:1807.10915 [pdf, other]

Unsupervised Adversarial Depth Estimation using Cycled Generative Networks

Authors: Andrea Pilzer, Dan Xu, Mihai Marian Puscas, Elisa Ricci, Nicu Sebe

Abstract: While recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance, costly ground truth annotations are required during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps and show that the depth estimation task can be effectively tackled within an adversarial lear… ▽ More While recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance, costly ground truth annotations are required during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps and show that the depth estimation task can be effectively tackled within an adversarial learning framework. Specifically, we propose a deep generative network that learns to predict the correspondence field i.e. the disparity map between two image views in a calibrated stereo camera setting. The proposed architecture consists of two generative sub-networks jointly trained with adversarial learning for reconstructing the disparity map and organized in a cycle such as to provide mutual constraints and supervision to each other. Extensive experiments on the publicly available datasets KITTI and Cityscapes demonstrate the effectiveness of the proposed model and competitive results with state of the art methods. The code and trained model are available on https://github.com/andrea-pilzer/unsup-stereo-depthGAN. △ Less

Submitted 28 July, 2018; originally announced July 2018.

Comments: To appear in 3DV 2018. Code is available on GitHub

arXiv:1806.00420 [pdf, other]

Whitening and Coloring batch transform for GANs

Authors: Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

Abstract: Batch Normalization (BN) is a common technique used to speed-up and stabilize training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks (cGANs) for representing class-specific information using conditional Batch Normalization (cBN). In this paper we propose to generalize both BN and cBN using a Whitening and Coloring based batch no… ▽ More Batch Normalization (BN) is a common technique used to speed-up and stabilize training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks (cGANs) for representing class-specific information using conditional Batch Normalization (cBN). In this paper we propose to generalize both BN and cBN using a Whitening and Coloring based batch normalization. We show that our conditional Coloring can represent categorical conditioning information which largely helps the cGAN qualitative results. Moreover, we show that full-feature whitening is important in a general GAN scenario in which the training process is known to be highly unstable. We test our approach on different datasets and using different GAN networks and training protocols, showing a consistent improvement in all the tested frameworks. Our CIFAR-10 conditioned results are higher than all previous works on this dataset. △ Less

Submitted 25 February, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: ICLR 2019

Showing 151–200 of 227 results for author: Sebe, N