Search | arXiv e-print repository

Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction

Authors: Ruihong Yin, Sezer Karaoglu, Theo Gevers

Abstract: In addition to color and textural information, geometry provides important cues for 3D scene reconstruction. However, current reconstruction methods only include geometry at the feature level thus not fully exploiting the geometric information. In contrast, this paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three level… ▽ More In addition to color and textural information, geometry provides important cues for 3D scene reconstruction. However, current reconstruction methods only include geometry at the feature level thus not fully exploiting the geometric information. In contrast, this paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision. First, geometry-guided feature learning encodes geometric priors to contain view-dependent information. Second, a geometry-guided adaptive feature fusion is introduced which utilizes the geometric priors as a guidance to adaptively generate weights for multiple views. Third, at the supervision level, taking the consistency between 2D and 3D normals into account, a consistent 3D normal loss is designed to add local constraints. Large-scale experiments are conducted on the ScanNet dataset, showing that volumetric methods with our geometry integration mechanism outperform state-of-the-art methods quantitatively as well as qualitatively. Volumetric methods with ours also show good generalization on the 7-Scenes and TUM RGB-D datasets. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: Accepted by ICCV2023

arXiv:2408.15524 [pdf, other]

Ray-Distance Volume Rendering for Neural Scene Reconstruction

Authors: Ruihong Yin, Yunlu Chen, Sezer Karaoglu, Theo Gevers

Abstract: Existing methods in neural scene reconstruction utilize the Signed Distance Function (SDF) to model the density function. However, in indoor scenes, the density computed from the SDF for a sampled point may not consistently reflect its real importance in volume rendering, often due to the influence of neighboring objects. To tackle this issue, our work proposes a novel approach for indoor scene re… ▽ More Existing methods in neural scene reconstruction utilize the Signed Distance Function (SDF) to model the density function. However, in indoor scenes, the density computed from the SDF for a sampled point may not consistently reflect its real importance in volume rendering, often due to the influence of neighboring objects. To tackle this issue, our work proposes a novel approach for indoor scene reconstruction, which instead parameterizes the density function with the Signed Ray Distance Function (SRDF). Firstly, the SRDF is predicted by the network and transformed to a ray-conditioned density function for volume rendering. We argue that the ray-specific SRDF only considers the surface along the camera ray, from which the derived density function is more consistent to the real occupancy than that from the SDF. Secondly, although SRDF and SDF represent different aspects of scene geometries, their values should share the same sign indicating the underlying spatial occupancy. Therefore, this work introduces a SRDF-SDF consistency loss to constrain the signs of the SRDF and SDF outputs. Thirdly, this work proposes a self-supervised visibility task, introducing the physical visibility geometry to the reconstruction task. The visibility task combines prior from predicted SRDF and SDF as pseudo labels, and contributes to generating more accurate 3D geometry. Our method implemented with different representations has been validated on indoor datasets, achieving improved performance in both reconstruction and view synthesis. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: Accepted by ECCV2024

arXiv:2408.10906 [pdf, other]

ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

Authors: Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Danda Pani Paudel

Abstract: 3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, w… ▽ More 3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, whose labels are in accordance with the respective datasets. The creation of this dataset utilized the compute equivalent of 2 GPU years on a TITAN XP GPU. We utilize our dataset for unsupervised pretraining and supervised finetuning for classification and segmentation tasks. To this end, we introduce \textbf{\textit{Gaussian-MAE}}, which highlights the unique benefits of representation learning from Gaussian parameters. Through exhaustive experiments, we provide several valuable insights. In particular, we show that (1) the distribution of the optimized GS centroids significantly differs from the uniformly sampled point cloud (used for initialization) counterpart; (2) this change in distribution results in degradation in classification but improvement in segmentation tasks when using only the centroids; (3) to leverage additional Gaussian parameters, we propose Gaussian feature grouping in a normalized feature space, along with splats pooling layer, offering a tailored solution to effectively group and embed similar Gaussians, which leads to notable improvement in finetuning tasks. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2407.20785 [pdf, other]

Retinex-Diffusion: On Controlling Illumination Conditions in Diffusion Models via Retinex Theory

Authors: Xiaoyan Xing, Vincent Tao Hu, Jan Hendrik Metzen, Konrad Groh, Sezer Karaoglu, Theo Gevers

Abstract: This paper introduces a novel approach to illumination manipulation in diffusion models, addressing the gap in conditional image generation with a focus on lighting conditions. We conceptualize the diffusion model as a black-box image render and strategically decompose its energy function in alignment with the image formation model. Our method effectively separates and controls illumination-relate… ▽ More This paper introduces a novel approach to illumination manipulation in diffusion models, addressing the gap in conditional image generation with a focus on lighting conditions. We conceptualize the diffusion model as a black-box image render and strategically decompose its energy function in alignment with the image formation model. Our method effectively separates and controls illumination-related properties during the generative process. It generates images with realistic illumination effects, including cast shadow, soft shadow, and inter-reflections. Remarkably, it achieves this without the necessity for learning intrinsic decomposition, finding directions in latent space, or undergoing additional training with new datasets. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.20727 [pdf, other]

SceneTeller: Language-to-3D Scene Generation

Authors: Başak Melis Öcal, Maxim Tatarchenko, Sezer Karaoglu, Theo Gevers

Abstract: Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizin… ▽ More Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizing 3D design. In this paper, we propose a pioneering approach for text-based 3D room design. Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it. With an additional text prompt the users can change the appearance of the entire scene or of individual objects in it. Built using in-context learning, CAD model retrieval and 3D-Gaussian-Splatting-based stylization, our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices. Our project page is available at https://sceneteller.github.io/. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: ECCV'24 camera-ready version

arXiv:2406.09126 [pdf, other]

Auto-Vocabulary Segmentation for LiDAR Points

Authors: Weijie Wei, Osman Ülger, Fatemeh Karimi Nejadasl, Theo Gevers, Martin R. Oswald

Abstract: Existing perception methods for autonomous driving fall short of recognizing unknown entities not covered in the training data. Open-vocabulary methods offer promising capabilities in detecting any object but are limited by user-specified queries representing target classes. We propose AutoVoc3D, a framework for automatic object class recognition and open-ended segmentation. Evaluation on nuScenes… ▽ More Existing perception methods for autonomous driving fall short of recognizing unknown entities not covered in the training data. Open-vocabulary methods offer promising capabilities in detecting any object but are limited by user-specified queries representing target classes. We propose AutoVoc3D, a framework for automatic object class recognition and open-ended segmentation. Evaluation on nuScenes showcases AutoVoc3D's ability to generate precise semantic classes and accurate point-wise segmentation. Moreover, we introduce Text-Point Semantic Similarity, a new metric to assess the semantic similarity between text and point cloud without eliminating novel classes. △ Less

Submitted 25 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by CVPR 2024 OpenSun3D Workshop

arXiv:2403.20092 [pdf, other]

Modeling Weather Uncertainty for Multi-weather Co-Presence Estimation

Authors: Qi Bi, Shaodi You, Theo Gevers

Abstract: Images from outdoor scenes may be taken under various weather conditions. It is well studied that weather impacts the performance of computer vision algorithms and needs to be handled properly. However, existing algorithms model weather condition as a discrete status and estimate it using multi-label classification. The fact is that, physically, specifically in meteorology, weather are modeled as… ▽ More Images from outdoor scenes may be taken under various weather conditions. It is well studied that weather impacts the performance of computer vision algorithms and needs to be handled properly. However, existing algorithms model weather condition as a discrete status and estimate it using multi-label classification. The fact is that, physically, specifically in meteorology, weather are modeled as a continuous and transitional status. Instead of directly implementing hard classification as existing multi-weather classification methods do, we consider the physical formulation of multi-weather conditions and model the impact of physical-related parameter on learning from the image appearance. In this paper, we start with solid revisit of the physics definition of weather and how it can be described as a continuous machine learning and computer vision task. Namely, we propose to model the weather uncertainty, where the level of probability and co-existence of multiple weather conditions are both considered. A Gaussian mixture model is used to encapsulate the weather uncertainty and a uncertainty-aware multi-weather learning scheme is proposed based on prior-posterior learning. A novel multi-weather co-presence estimation transformer (MeFormer) is proposed. In addition, a new multi-weather co-presence estimation (MePe) dataset, along with 14 fine-grained weather categories and 16,078 samples, is proposed to benchmark both conventional multi-label weather classification task and multi-weather co-presence estimation task. Large scale experiments show that the proposed method achieves state-of-the-art performance and substantial generalization capabilities on both the conventional multi-label weather classification task and the proposed multi-weather co-presence estimation task. Besides, modeling weather uncertainty also benefits adverse-weather semantic segmentation. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Work in progress

arXiv:2312.10217 [pdf, other]

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

Authors: Weijie Wei, Fatemeh Karimi Nejadasl, Theo Gevers, Martin R. Oswald

Abstract: The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-trai… ▽ More The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches. Codes will be released at https://github.com/codename1995/T-MAE △ Less

Submitted 22 July, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted to ECCV 2024

arXiv:2312.10070 [pdf, other]

Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting

Authors: Vladimir Yugay, Yue Li, Theo Gevers, Martin R. Oswald

Abstract: We present a dense simultaneous localization and mapping (SLAM) method that uses 3D Gaussians as a scene representation. Our approach enables interactive-time reconstruction and photo-realistic rendering from real-world single-camera RGBD videos. To this end, we propose a novel effective strategy for seeding new Gaussians for newly explored areas and their effective online optimization that is ind… ▽ More We present a dense simultaneous localization and mapping (SLAM) method that uses 3D Gaussians as a scene representation. Our approach enables interactive-time reconstruction and photo-realistic rendering from real-world single-camera RGBD videos. To this end, we propose a novel effective strategy for seeding new Gaussians for newly explored areas and their effective online optimization that is independent of the scene size and thus scalable to larger scenes. This is achieved by organizing the scene into sub-maps which are independently optimized and do not need to be kept in memory. We further accomplish frame-to-model camera tracking by minimizing photometric and geometric losses between the input and rendered frames. The Gaussian representation allows for high-quality photo-realistic real-time rendering of real-world scenes. Evaluation on synthetic and real-world datasets demonstrates competitive or superior performance in mapping, tracking, and rendering compared to existing neural dense SLAM methods. △ Less

Submitted 22 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2310.07669 [pdf, other]

HaarNet: Large-scale Linear-Morphological Hybrid Network for RGB-D Semantic Segmentation

Authors: Rick Groenendijk, Leo Dorst, Theo Gevers

Abstract: Signals from different modalities each have their own combination algebra which affects their sampling processing. RGB is mostly linear; depth is a geometric signal following the operations of mathematical morphology. If a network obtaining RGB-D input has both kinds of operators available in its layers, it should be able to give effective output with fewer parameters. In this paper, morphological… ▽ More Signals from different modalities each have their own combination algebra which affects their sampling processing. RGB is mostly linear; depth is a geometric signal following the operations of mathematical morphology. If a network obtaining RGB-D input has both kinds of operators available in its layers, it should be able to give effective output with fewer parameters. In this paper, morphological elements in conjunction with more familiar linear modules are used to construct a mixed linear-morphological network called HaarNet. This is the first large-scale linear-morphological hybrid, evaluated on a set of sizeable real-world datasets. In the network, morphological Haar sampling is applied to both feature channels in several layers, which splits extreme values and high-frequency information such that both can be processed to improve both modalities. Moreover, morphologically parameterised ReLU is used, and morphologically-sound up-sampling is applied to obtain a full-resolution output. Experiments show that HaarNet is competitive with a state-of-the-art CNN, implying that morphological networks are a promising research direction for geometry-based learning tasks. △ Less

Submitted 11 October, 2023; originally announced October 2023.

ACM Class: I.4.6; I.2.6

arXiv:2310.07573 [pdf, other]

Relational Prior Knowledge Graphs for Detection and Instance Segmentation

Authors: Osman Ülger, Yu Wang, Ysbrand Galama, Sezer Karaoglu, Theo Gevers, Martin R. Oswald

Abstract: Humans have a remarkable ability to perceive and reason about the world around them by understanding the relationships between objects. In this paper, we investigate the effectiveness of using such relationships for object detection and instance segmentation. To this end, we propose a Relational Prior-based Feature Enhancement Model (RP-FEM), a graph transformer that enhances object proposal featu… ▽ More Humans have a remarkable ability to perceive and reason about the world around them by understanding the relationships between objects. In this paper, we investigate the effectiveness of using such relationships for object detection and instance segmentation. To this end, we propose a Relational Prior-based Feature Enhancement Model (RP-FEM), a graph transformer that enhances object proposal features using relational priors. The proposed architecture operates on top of scene graphs obtained from initial proposals and aims to concurrently learn relational context modeling for object detection and instance segmentation. Experimental evaluations on COCO show that the utilization of scene graphs, augmented with relational priors, offer benefits for object detection and instance segmentation. RP-FEM demonstrates its capacity to suppress improbable class predictions within the image while also preventing the model from generating duplicate predictions, leading to improvements over the baseline model on which it is built. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Published in ICCV2023 SG2RL Workshop

arXiv:2309.17162 [pdf, other]

APNet: Urban-level Scene Segmentation of Aerial Images and Point Clouds

Authors: Weijie Wei, Martin R. Oswald, Fatemeh Karimi Nejadasl, Theo Gevers

Abstract: In this paper, we focus on semantic segmentation method for point clouds of urban scenes. Our fundamental concept revolves around the collaborative utilization of diverse scene representations to benefit from different context information and network architectures. To this end, the proposed network architecture, called APNet, is split into two branches: a point cloud branch and an aerial image bra… ▽ More In this paper, we focus on semantic segmentation method for point clouds of urban scenes. Our fundamental concept revolves around the collaborative utilization of diverse scene representations to benefit from different context information and network architectures. To this end, the proposed network architecture, called APNet, is split into two branches: a point cloud branch and an aerial image branch which input is generated from a point cloud. To leverage the different properties of each branch, we employ a geometry-aware fusion module that is learned to combine the results of each branch. Additional separate losses for each branch avoid that one branch dominates the results, ensure the best performance for each branch individually and explicitly define the input domain of the fusion network assuring it only performs data fusion. Our experiments demonstrate that the fusion output consistently outperforms the individual network branches and that APNet achieves state-of-the-art performance of 65.2 mIoU on the SensatUrban dataset. Upon acceptance, the source code will be made accessible. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Accepted by ICCV Workshop 2023 and selected as an oral

arXiv:2307.10924 [pdf, other]

Intrinsic Image Decomposition Using Point Cloud Representation

Authors: Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers

Abstract: The purpose of intrinsic decomposition is to separate an image into its albedo (reflective properties) and shading components (illumination properties). This is challenging because it's an ill-posed problem. Conventional approaches primarily concentrate on 2D imagery and fail to fully exploit the capabilities of 3D data representation. 3D point clouds offer a more comprehensive format for represen… ▽ More The purpose of intrinsic decomposition is to separate an image into its albedo (reflective properties) and shading components (illumination properties). This is challenging because it's an ill-posed problem. Conventional approaches primarily concentrate on 2D imagery and fail to fully exploit the capabilities of 3D data representation. 3D point clouds offer a more comprehensive format for representing scenes, as they combine geometric and color information effectively. To this end, in this paper, we introduce Point Intrinsic Net (PoInt-Net), which leverages 3D point cloud data to concurrently estimate albedo and shading maps. The merits of PoInt-Net include the following aspects. First, the model is efficient, achieving consistent performance across point clouds of any size with training only required on small-scale point clouds. Second, it exhibits remarkable robustness; even when trained exclusively on datasets comprising individual objects, PoInt-Net demonstrates strong generalization to unseen objects and scenes. Third, it delivers superior accuracy over conventional 2D approaches, demonstrating enhanced performance across various metrics on different datasets. (Code Released) △ Less

Submitted 28 March, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Code: https://github.com/xyxingx/PoInt-Net

arXiv:2307.00371 [pdf, other]

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Authors: Qi Bi, Shaodi You, Theo Gevers

Abstract: Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing ap… ▽ More Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. To achieve this, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, as lower-resolution image features usually contain more robust content information and are less sensitive to style variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods for domain-generalized semantic segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean intersection over union). The source code is publicly available at \url{https://github.com/BiQiWHU/CMFormer}. △ Less

Submitted 17 December, 2023; v1 submitted 1 July, 2023; originally announced July 2023.

Comments: Accepted by AAAI 2024. Camera-ready version with available source code

arXiv:2211.14037 [pdf, other]

MorphPool: Efficient Non-linear Pooling & Unpooling in CNNs

Authors: Rick Groenendijk, Leo Dorst, Theo Gevers

Abstract: Pooling is essentially an operation from the field of Mathematical Morphology, with max pooling as a limited special case. The more general setting of MorphPooling greatly extends the tool set for building neural networks. In addition to pooling operations, encoder-decoder networks used for pixel-level predictions also require unpooling. It is common to combine unpooling with convolution or deconv… ▽ More Pooling is essentially an operation from the field of Mathematical Morphology, with max pooling as a limited special case. The more general setting of MorphPooling greatly extends the tool set for building neural networks. In addition to pooling operations, encoder-decoder networks used for pixel-level predictions also require unpooling. It is common to combine unpooling with convolution or deconvolution for up-sampling. However, using its morphological properties, unpooling can be generalised and improved. Extensive experimentation on two tasks and three large-scale datasets shows that morphological pooling and unpooling lead to improved predictive performance at much reduced parameter counts. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted paper at the British Machine Vision Conference (BMVC) 2022

ACM Class: I.4.10; I.3.5; I.4.6

arXiv:2208.14369 [pdf, other]

SIGNet: Intrinsic Image Decomposition by a Semantic and Invariant Gradient Driven Network for Indoor Scenes

Authors: Partha Das, Sezer Karaoglu, Arjan Gijsenij, Theo Gevers

Abstract: Intrinsic image decomposition (IID) is an under-constrained problem. Therefore, traditional approaches use hand crafted priors to constrain the problem. However, these constraints are limited when coping with complex scenes. Deep learning-based approaches learn these constraints implicitly through the data, but they often suffer from dataset biases (due to not being able to include all possible im… ▽ More Intrinsic image decomposition (IID) is an under-constrained problem. Therefore, traditional approaches use hand crafted priors to constrain the problem. However, these constraints are limited when coping with complex scenes. Deep learning-based approaches learn these constraints implicitly through the data, but they often suffer from dataset biases (due to not being able to include all possible imaging conditions). In this paper, a combination of the two is proposed. Component specific priors like semantics and invariant features are exploited to obtain semantically and physically plausible reflectance transitions. These transitions are used to steer a progressive CNN with implicit homogeneity constraints to decompose reflectance and shading maps. An ablation study is conducted showing that the use of the proposed priors and progressive CNN increase the IID performance. State of the art performance on both our proposed dataset and the standard real-world IIW dataset shows the effectiveness of the proposed method. Code is made available at https://github.com/Morpheus3000/SIGNet △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2204.04076 [pdf, other]

doi 10.1364/JOSAA.414682

Invariant Descriptors for Intrinsic Reflectance Optimization

Authors: Anil S. Baslamisli, Theo Gevers

Abstract: Intrinsic image decomposition aims to factorize an image into albedo (reflectance) and shading (illumination) sub-components. Being ill-posed and under-constrained, it is a very challenging computer vision problem. There are infinite pairs of reflectance and shading images that can reconstruct the same input. To address the problem, Intrinsic Images in the Wild provides an optimization framework b… ▽ More Intrinsic image decomposition aims to factorize an image into albedo (reflectance) and shading (illumination) sub-components. Being ill-posed and under-constrained, it is a very challenging computer vision problem. There are infinite pairs of reflectance and shading images that can reconstruct the same input. To address the problem, Intrinsic Images in the Wild provides an optimization framework based on a dense conditional random field (CRF) formulation that considers long-range material relations. We improve upon their model by introducing illumination invariant image descriptors: color ratios. The color ratios and the reflectance intrinsic are both invariant to illumination and thus are highly correlated. Through detailed experiments, we provide ways to inject the color ratios into the dense CRF optimization. Our approach is physics-based, learning-free and leads to more accurate and robust reflectance decompositions. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Journal ref: Journal of the Optical Society of America A, Vol. 38, Issue 6, pp. 887-896 (2021)

arXiv:2203.16670 [pdf, other]

PIE-Net: Photometric Invariant Edge Guided Network for Intrinsic Image Decomposition

Authors: Partha Das, Sezer Karaoglu, Theo Gevers

Abstract: Intrinsic image decomposition is the process of recovering the image formation components (reflectance and shading) from an image. Previous methods employ either explicit priors to constrain the problem or implicit constraints as formulated by their losses (deep learning). These methods can be negatively influenced by strong illumination conditions causing shading-reflectance leakages. Therefore… ▽ More Intrinsic image decomposition is the process of recovering the image formation components (reflectance and shading) from an image. Previous methods employ either explicit priors to constrain the problem or implicit constraints as formulated by their losses (deep learning). These methods can be negatively influenced by strong illumination conditions causing shading-reflectance leakages. Therefore, in this paper, an end-to-end edge-driven hybrid CNN approach is proposed for intrinsic image decomposition. Edges correspond to illumination invariant gradients. To handle hard negative illumination transitions, a hierarchical approach is taken including global and local refinement layers. We make use of attention layers to further strengthen the learning process. An extensive ablation study and large scale experiments are conducted showing that it is beneficial for edge-driven hybrid IID networks to make use of illumination invariant descriptors and that separating global and local cues helps in improving the performance of the network. Finally, it is shown that the proposed method obtains state of the art performance and is able to generalise well to real world images. The project page with pretrained models, finetuned models and network code can be found at https://ivi.fnwi.uva.nl/cv/pienet/. △ Less

Submitted 2 May, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

arXiv:2109.00863 [pdf, other]

Generative Models for Multi-Illumination Color Constancy

Authors: Partha Das, Yang Liu, Sezer Karaoglu, Theo Gevers

Abstract: In this paper, the aim is multi-illumination color constancy. However, most of the existing color constancy methods are designed for single light sources. Furthermore, datasets for learning multiple illumination color constancy are largely missing. We propose a seed (physics driven) based multi-illumination color constancy method. GANs are exploited to model the illumination estimation problem as… ▽ More In this paper, the aim is multi-illumination color constancy. However, most of the existing color constancy methods are designed for single light sources. Furthermore, datasets for learning multiple illumination color constancy are largely missing. We propose a seed (physics driven) based multi-illumination color constancy method. GANs are exploited to model the illumination estimation problem as an image-to-image domain translation problem. Additionally, a novel multi-illumination data augmentation method is proposed. Experiments on single and multi-illumination datasets show that our methods outperform sota methods. △ Less

Submitted 2 September, 2021; originally announced September 2021.

Comments: Accepted in International Conference on Computer Vision Workshop (ICCVW) 2021

arXiv:2011.04389 [pdf, other]

EDEN: Multimodal Synthetic Dataset of Enclosed GarDEN Scenes

Authors: Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, Theo Gevers

Abstract: Multimodal large-scale datasets for outdoor scenes are mostly designed for urban driving problems. The scenes are highly structured and semantically different from scenarios seen in nature-centered scenes such as gardens or parks. To promote machine learning methods for nature-oriented applications, such as agriculture and gardening, we propose the multimodal synthetic dataset for Enclosed garDEN… ▽ More Multimodal large-scale datasets for outdoor scenes are mostly designed for urban driving problems. The scenes are highly structured and semantically different from scenarios seen in nature-centered scenes such as gardens or parks. To promote machine learning methods for nature-oriented applications, such as agriculture and gardening, we propose the multimodal synthetic dataset for Enclosed garDEN scenes (EDEN). The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow. Experimental results on the state-of-the-art methods for semantic segmentation and monocular depth prediction, two important tasks in computer vision, show positive impact of pre-training deep networks on our dataset for unstructured natural scenes. The dataset and related materials will be available at https://lhoangan.github.io/eden. △ Less

Submitted 10 November, 2020; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted for publishing at WACV 2021

arXiv:2010.11844 [pdf, other]

Spatio-temporal Features for Generalized Detection of Deepfake Videos

Authors: Ipek Ganiyusufoglu, L. Minh Ngô, Nedko Savov, Sezer Karaoglu, Theo Gevers

Abstract: For deepfake detection, video-level detectors have not been explored as extensively as image-level detectors, which do not exploit temporal data. In this paper, we empirically show that existing approaches on image and sequence classifiers generalize poorly to new manipulation techniques. To this end, we propose spatio-temporal features, modeled by 3D CNNs, to extend the generalization capabilitie… ▽ More For deepfake detection, video-level detectors have not been explored as extensively as image-level detectors, which do not exploit temporal data. In this paper, we empirically show that existing approaches on image and sequence classifiers generalize poorly to new manipulation techniques. To this end, we propose spatio-temporal features, modeled by 3D CNNs, to extend the generalization capabilities to detect new sorts of deepfake videos. We show that spatial features learn distinct deepfake-method-specific attributes, while spatio-temporal features capture shared attributes between deepfake methods. We provide an in-depth analysis of how the sequential and spatio-temporal video encoders are utilizing temporal information using DFDC dataset arXiv:2006.07397. Thus, we unravel that our approach captures local spatio-temporal relations and inconsistencies in the deepfake videos while existing sequence encoders are indifferent to it. Through large scale experiments conducted on the FaceForensics++ arXiv:1901.08971 and Deeper Forensics arXiv:2001.03024 datasets, we show that our approach outperforms existing methods in terms of generalization capabilities. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: Submitted to Computer Vision and Image Understanding (CVIU)

arXiv:2009.08321 [pdf, other]

Novel View Synthesis from Single Images via Point Cloud Transformation

Authors: Hoang-An Le, Thomas Mensink, Partha Das, Theo Gevers

Abstract: In this paper the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation isdesired. Our method estimates point clouds to capture the geometry of the object, which can be freely rotated into the desired view and then projected into a new image. This image, however, is sparse by nature and hence this… ▽ More In this paper the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation isdesired. Our method estimates point clouds to capture the geometry of the object, which can be freely rotated into the desired view and then projected into a new image. This image, however, is sparse by nature and hence this coarse view is used as the input of an image completion network to obtain the dense target view. The point cloud is obtained using the predicted pixel-wise depth map, estimated from a single RGB input image,combined with the camera intrinsics. By using forward warping and backward warpingbetween the input view and the target view, the network can be trained end-to-end without supervision on depth. The benefit of using point clouds as an explicit 3D shape for novel view synthesis is experimentally validated on the 3D ShapeNet benchmark. Source code and data will be available at https://lhoangan.github.io/pc4novis/. △ Less

Submitted 18 September, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: Accepted at British Machine Vision Conference 2020

arXiv:2009.01717 [pdf, other]

Multi-Loss Weighting with Coefficient of Variations

Authors: Rick Groenendijk, Sezer Karaoglu, Theo Gevers, Thomas Mensink

Abstract: Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid… ▽ More Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid search. This is computationally expensive. In this paper, we propose a weighting scheme based on the coefficient of variations and set the weights based on properties observed while training the model. The proposed method incorporates a measure of uncertainty to balance the losses, and as a result the loss weights evolve during training without requiring another (learning based) optimisation. In contrast to many loss weighting methods in literature, we focus on single-task multi-loss problems, such as monocular depth estimation and semantic segmentation, and show that multi-task approaches for loss weighting do not work on those single-tasks. The validity of the approach is shown empirically for depth estimation and semantic segmentation on multiple datasets. △ Less

Submitted 10 November, 2020; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: Paper was accepted at the IEEE Winter Conference on Applications of Computer Vision 2021 (WACV2021)

MSC Class: 68T45 ACM Class: I.4

arXiv:2009.01540 [pdf, other]

Physics-based Shading Reconstruction for Intrinsic Image Decomposition

Authors: Anil S. Baslamisli, Yang Liu, Sezer Karaoglu, Theo Gevers

Abstract: We investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial sparse shading map is calculated directly from the corresponding RGB image gradients in a learning-free unsupervise… ▽ More We investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial sparse shading map is calculated directly from the corresponding RGB image gradients in a learning-free unsupervised manner. Then, an optimization method is proposed to reconstruct the full dense shading map. Finally, we integrate the generated shading map into a novel deep learning framework to refine it and also to predict corresponding albedo image to achieve intrinsic image decomposition. By doing so, we are the first to directly address the texture and intensity ambiguity problems of the shading estimations. Large scale experiments show that our approach steered by physics-based invariant descriptors achieve superior results on MIT Intrinsics, NIR-RGB Intrinsics, Multi-Illuminant Intrinsic Images, Spectral Intrinsic Images, As Realistic As Possible, and competitive results on Intrinsic Images in the Wild datasets while achieving state-of-the-art shading estimations. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: Submitted to Computer Vision and Image Understanding (CVIU)

arXiv:2004.06382 [pdf, other]

Kinship Identification through Joint Learning Using Kinship Verification Ensembles

Authors: Wei Wang, Shaodi You, Sezer Karaoglu, Theo Gevers

Abstract: Kinship verification is a well-explored task: identifying whether or not two persons are kin. In contrast, kinship identification has been largely ignored so far. Kinship identification aims to further identify the particular type of kinship. An extension to kinship verification run short to properly obtain identification, because existing verification networks are individually trained on specific… ▽ More Kinship verification is a well-explored task: identifying whether or not two persons are kin. In contrast, kinship identification has been largely ignored so far. Kinship identification aims to further identify the particular type of kinship. An extension to kinship verification run short to properly obtain identification, because existing verification networks are individually trained on specific kinships and do not consider the context between different kinship types. Also, existing kinship verification datasets have biased positive-negative distributions which are different than real-world distributions. To this end, we propose a novel kinship identification approach based on joint training of kinship verification ensembles and classification modules. We propose to rebalance the training dataset to become more realistic. Large scale experiments demonstrate the appealing performance on kinship identification. The experiments further show significant performance improvement of kinship verification when trained on the same dataset with more realistic distributions. △ Less

Submitted 24 August, 2020; v1 submitted 14 April, 2020; originally announced April 2020.

Comments: 14 pages, 7 figures

arXiv:1912.04023 [pdf, other]

ShadingNet: Image Intrinsics by Fine-Grained Shading Decomposition

Authors: Anil S. Baslamisli, Partha Das, Hoang-An Le, Sezer Karaoglu, Theo Gevers

Abstract: In general, intrinsic image decomposition algorithms interpret shading as one unified component including all photometric effects. As shading transitions are generally smoother than reflectance (albedo) changes, these methods may fail in distinguishing strong photometric effects from reflectance variations. Therefore, in this paper, we propose to decompose the shading component into direct (illumi… ▽ More In general, intrinsic image decomposition algorithms interpret shading as one unified component including all photometric effects. As shading transitions are generally smoother than reflectance (albedo) changes, these methods may fail in distinguishing strong photometric effects from reflectance variations. Therefore, in this paper, we propose to decompose the shading component into direct (illumination) and indirect shading (ambient light and shadows) subcomponents. The aim is to distinguish strong photometric effects from reflectance variations. An end-to-end deep convolutional neural network (ShadingNet) is proposed that operates in a fine-to-coarse manner with a specialized fusion and refinement unit exploiting the fine-grained shading model. It is designed to learn specific reflectance cues separated from specific photometric effects to analyze the disentanglement capability. A large-scale dataset of scene-level synthetic images of outdoor natural environments is provided with fine-grained intrinsic image ground-truths. Large scale experiments show that our approach using fine-grained shading decompositions outperforms state-of-the-art algorithms utilizing unified shading on NED, MPI Sintel, GTA V, IIW, MIT Intrinsic Images, 3DRMS and SRD datasets. △ Less

Submitted 21 January, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: Submitted to International Journal of Computer Vision (IJCV)

arXiv:1910.13340 [pdf, other]

doi 10.1016/j.cviu.2019.102848

On the Benefit of Adversarial Training for Monocular Depth Estimation

Authors: Rick Groenendijk, Sezer Karaoglu, Theo Gevers, Thomas Mensink

Abstract: In this paper we address the benefit of adding adversarial training to the task of monocular depth estimation. A model can be trained in a self-supervised setting on stereo pairs of images, where depth (disparities) are an intermediate result in a right-to-left image reconstruction pipeline. For the quality of the image reconstruction and disparity prediction, a combination of different losses is… ▽ More In this paper we address the benefit of adding adversarial training to the task of monocular depth estimation. A model can be trained in a self-supervised setting on stereo pairs of images, where depth (disparities) are an intermediate result in a right-to-left image reconstruction pipeline. For the quality of the image reconstruction and disparity prediction, a combination of different losses is used, including L1 image reconstruction losses and left-right disparity smoothness. These are local pixel-wise losses, while depth prediction requires global consistency. Therefore, we extend the self-supervised network to become a Generative Adversarial Network (GAN), by including a discriminator which should tell apart reconstructed (fake) images from real images. We evaluate Vanilla GANs, LSGANs and Wasserstein GANs in combination with different pixel-wise reconstruction losses. Based on extensive experimental evaluation, we conclude that adversarial training is beneficial if and only if the reconstruction loss is not too constrained. Even though adversarial training seems promising because it promotes global consistency, non-adversarial training outperforms (or is on par with) any method trained with a GAN when a constrained reconstruction loss is used in combination with batch normalisation. Based on the insights of our experimental evaluation we obtain state-of-the art monocular depth estimation results by using batch normalisation and different output scales. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: 11 pages, 8 tables, 5 figures, accepted at CVIU

MSC Class: 68T45

arXiv:1812.10558 [pdf, other]

Deception Detection by 2D-to-3D Face Reconstruction from Videos

Authors: Minh Ngô, Burak Mandira, Selim Fırat Yılmaz, Ward Heij, Sezer Karaoglu, Henri Bouma, Hamdi Dibeklioglu, Theo Gevers

Abstract: Lies and deception are common phenomena in society, both in our private and professional lives. However, humans are notoriously bad at accurate deception detection. Based on the literature, human accuracy of distinguishing between lies and truthful statements is 54% on average, in other words it is slightly better than a random guess. While people do not much care about this issue, in high-stakes… ▽ More Lies and deception are common phenomena in society, both in our private and professional lives. However, humans are notoriously bad at accurate deception detection. Based on the literature, human accuracy of distinguishing between lies and truthful statements is 54% on average, in other words it is slightly better than a random guess. While people do not much care about this issue, in high-stakes situations such as interrogations for series crimes and for evaluating the testimonies in court cases, accurate deception detection methods are highly desirable. To achieve a reliable, covert, and non-invasive deception detection, we propose a novel method that jointly extracts reliable low- and high-level facial features namely, 3D facial geometry, skin reflectance, expression, head pose, and scene illumination in a video sequence. Then these features are modeled using a Recurrent Neural Network to learn temporal characteristics of deceptive and honest behavior. We evaluate the proposed method on the Real-Life Trial (RLT) dataset that contains high-stake deceptive and honest videos recorded in courtrooms. Our results show that the proposed method (with an accuracy of 72.8%) improves the state of the art as well as outperforming the use of manually coded facial attributes 67.6%) in deception detection. △ Less

Submitted 26 December, 2018; originally announced December 2018.

Comments: 9 pages, 3 figures

arXiv:1812.07363 [pdf, other]

Improving Face Detection Performance with 3D-Rendered Synthetic Data

Authors: Jian Han, Sezer Karaoglu, Hoang-An Le, Theo Gevers

Abstract: In this paper, we provide a synthetic data generator methodology with fully controlled, multifaceted variations based on a new 3D face dataset (3DU-Face). We customized synthetic datasets to address specific types of variations (scale, pose, occlusion, blur, etc.), and systematically investigate the influence of different variations on face detection performances. We examine whether and how these… ▽ More In this paper, we provide a synthetic data generator methodology with fully controlled, multifaceted variations based on a new 3D face dataset (3DU-Face). We customized synthetic datasets to address specific types of variations (scale, pose, occlusion, blur, etc.), and systematically investigate the influence of different variations on face detection performances. We examine whether and how these factors contribute to better face detection performances. We validate our synthetic data augmentation for different face detectors (Faster RCNN, SSH and HR) on various face datasets (MAFA, UFDD and Wider Face). △ Less

Submitted 27 November, 2019; v1 submitted 18 December, 2018; originally announced December 2018.

Comments: 11 pages. Submitted to Pattern Recognition Letters

arXiv:1812.03085 [pdf, other]

Color Constancy by GANs: An Experimental Survey

Authors: Partha Das, Anil S. Baslamisli, Yang Liu, Sezer Karaoglu, Theo Gevers

Abstract: In this paper, we formulate the color constancy task as an image-to-image translation problem using GANs. By conducting a large set of experiments on different datasets, an experimental survey is provided on the use of different types of GANs to solve for color constancy i.e. CC-GANs (Color Constancy GANs). Based on the experimental review, recommendations are given for the design of CC-GAN archit… ▽ More In this paper, we formulate the color constancy task as an image-to-image translation problem using GANs. By conducting a large set of experiments on different datasets, an experimental survey is provided on the use of different types of GANs to solve for color constancy i.e. CC-GANs (Color Constancy GANs). Based on the experimental review, recommendations are given for the design of CC-GAN architectures based on different criteria, circumstances and datasets. △ Less

Submitted 7 December, 2018; originally announced December 2018.

arXiv:1812.01946 [pdf, other]

doi 10.1016/j.cviu.2021.103274

Automatic Generation of Dense Non-rigid Optical Flow

Authors: Hoàng-Ân Lê, Tushar Nimbhorkar, Thomas Mensink, Anil S. Baslamisli, Sezer Karaoglu, Theo Gevers

Abstract: There hardly exists any large-scale datasets with dense optical flow of non-rigid motion from real-world imagery as of today. The reason lies mainly in the required setup to derive ground truth optical flows: a series of images with known camera poses along its trajectory, and an accurate 3D model from a textured scene. Human annotation is not only too tedious for large databases, it can simply ha… ▽ More There hardly exists any large-scale datasets with dense optical flow of non-rigid motion from real-world imagery as of today. The reason lies mainly in the required setup to derive ground truth optical flows: a series of images with known camera poses along its trajectory, and an accurate 3D model from a textured scene. Human annotation is not only too tedious for large databases, it can simply hardly contribute to accurate optical flow. To circumvent the need for manual annotation, we propose a framework to automatically generate optical flow from real-world videos. The method extracts and matches objects from video frames to compute initial constraints, and applies a deformation over the objects of interest to obtain dense optical flow fields. We propose several ways to augment the optical flow variations. Extensive experimental results show that training on our automatically generated optical flow outperforms methods that are trained on rigid synthetic data using FlowNet-S, LiteFlowNet, PWC-Net, and RAFT. Datasets and implementation of our optical flow generation framework are released at https://github.com/lhoangan/arap_flow △ Less

Submitted 7 September, 2021; v1 submitted 5 December, 2018; originally announced December 2018.

Comments: The paper is accepted for publication for Computer Vision and Image Understanding (CVIU)

Journal ref: Volume 212, November 2021, 103274

arXiv:1812.01402 [pdf, other]

Inferring Point Clouds from Single Monocular Images by Depth Intermediation

Authors: Wei Zeng, Sezer Karaoglu, Theo Gevers

Abstract: In this paper, we propose a pipeline to generate 3D point cloud of an object from a single-view RGB image. Most previous work predict the 3D point coordinates from single RGB images directly. We decompose this problem into depth estimation from single images and point cloud completion from partial point clouds. Our method sequentially predicts the depth maps from images and then infers the compl… ▽ More In this paper, we propose a pipeline to generate 3D point cloud of an object from a single-view RGB image. Most previous work predict the 3D point coordinates from single RGB images directly. We decompose this problem into depth estimation from single images and point cloud completion from partial point clouds. Our method sequentially predicts the depth maps from images and then infers the complete 3D object point clouds based on the predicted partial point clouds. We explicitly impose the camera model geometrical constraint in our pipeline and enforce the alignment of the generated point clouds and estimated depth maps. Experimental results for the single image 3D object reconstruction task show that the proposed method outperforms existing state-of-the-art methods. Both the qualitative and quantitative results demonstrate the generality and suitability of our method. △ Less

Submitted 26 October, 2020; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: Statement: This paper is under consideration at Computer Vision and Image Understanding

arXiv:1807.11857 [pdf, other]

Joint Learning of Intrinsic Images and Semantic Segmentation

Authors: Anil S. Baslamisli, Thomas T. Groenestege, Partha Das, Hoang-An Le, Sezer Karaoglu, Theo Gevers

Abstract: Semantic segmentation of outdoor scenes is problematic when there are variations in imaging conditions. It is known that albedo (reflectance) is invariant to all kinds of illumination effects. Thus, using reflectance images for semantic segmentation task can be favorable. Additionally, not only segmentation may benefit from reflectance, but also segmentation may be useful for reflectance computati… ▽ More Semantic segmentation of outdoor scenes is problematic when there are variations in imaging conditions. It is known that albedo (reflectance) is invariant to all kinds of illumination effects. Thus, using reflectance images for semantic segmentation task can be favorable. Additionally, not only segmentation may benefit from reflectance, but also segmentation may be useful for reflectance computation. Therefore, in this paper, the tasks of semantic segmentation and intrinsic image decomposition are considered as a combined process by exploring their mutual relationship in a joint fashion. To that end, we propose a supervised end-to-end CNN architecture to jointly learn intrinsic image decomposition and semantic segmentation. We analyze the gains of addressing those two problems jointly. Moreover, new cascade CNN architectures for intrinsic-for-segmentation and segmentation-for-intrinsic are proposed as single tasks. Furthermore, a dataset of 35K synthetic images of natural environments is created with corresponding albedo and shading (intrinsics), as well as semantic labels (segmentation) assigned to each object/scene. The experiments show that joint learning of intrinsic image decomposition and semantic segmentation is beneficial for both tasks for natural scenes. Dataset and models are available at: https://ivi.fnwi.uva.nl/cv/intrinseg △ Less

Submitted 31 July, 2018; originally announced July 2018.

Comments: ECCV 2018

arXiv:1807.07473 [pdf, other]

Three for one and one for three: Flow, Segmentation, and Surface Normals

Authors: Hoang-An Le, Anil S. Baslamisli, Thomas Mensink, Theo Gevers

Abstract: Optical flow, semantic segmentation, and surface normals represent different information modalities, yet together they bring better cues for scene understanding problems. In this paper, we study the influence between the three modalities: how one impacts on the others and their efficiency in combination. We employ a modular approach using a convolutional refinement network which is trained supervi… ▽ More Optical flow, semantic segmentation, and surface normals represent different information modalities, yet together they bring better cues for scene understanding problems. In this paper, we study the influence between the three modalities: how one impacts on the others and their efficiency in combination. We employ a modular approach using a convolutional refinement network which is trained supervised but isolated from RGB images to enforce joint modality features. To assist the training process, we create a large-scale synthetic outdoor dataset that supports dense annotation of semantic segmentation, optical flow, and surface normals. The experimental results show positive influence among the three modalities, especially for objects' boundaries, region consistency, and scene structures. △ Less

Submitted 19 July, 2018; originally announced July 2018.

Comments: BMVC 2018

arXiv:1804.01792 [pdf, other]

TrimBot2020: an outdoor robot for automatic gardening

Authors: Nicola Strisciuglio, Radim Tylecek, Michael Blaich, Nicolai Petkov, Peter Bieber, Jochen Hemming, Eldert van Henten, Torsten Sattler, Marc Pollefeys, Theo Gevers, Thomas Brox, Robert B. Fisher

Abstract: Robots are increasingly present in modern industry and also in everyday life. Their applications range from health-related situations, for assistance to elderly people or in surgical operations, to automatic and driver-less vehicles (on wheels or flying) or for driving assistance. Recently, an interest towards robotics applied in agriculture and gardening has arisen, with applications to automatic… ▽ More Robots are increasingly present in modern industry and also in everyday life. Their applications range from health-related situations, for assistance to elderly people or in surgical operations, to automatic and driver-less vehicles (on wheels or flying) or for driving assistance. Recently, an interest towards robotics applied in agriculture and gardening has arisen, with applications to automatic seeding and cropping or to plant disease control, etc. Autonomous lawn mowers are succesful market applications of gardening robotics. In this paper, we present a novel robot that is developed within the TrimBot2020 project, funded by the EU H2020 program. The project aims at prototyping the first outdoor robot for automatic bush trimming and rose pruning. △ Less

Submitted 15 May, 2018; v1 submitted 5 April, 2018; originally announced April 2018.

Comments: Accepted for publication at International Sympsium on Robotics 2018

arXiv:1712.01056 [pdf, other]

CNN based Learning using Reflection and Retinex Models for Intrinsic Image Decomposition

Authors: Anil S. Baslamisli, Hoang-An Le, Theo Gevers

Abstract: Most of the traditional work on intrinsic image decomposition rely on deriving priors about scene characteristics. On the other hand, recent research use deep learning models as in-and-out black box and do not consider the well-established, traditional image formation process as the basis of their intrinsic learning process. As a consequence, although current deep learning approaches show superior… ▽ More Most of the traditional work on intrinsic image decomposition rely on deriving priors about scene characteristics. On the other hand, recent research use deep learning models as in-and-out black box and do not consider the well-established, traditional image formation process as the basis of their intrinsic learning process. As a consequence, although current deep learning approaches show superior performance when considering quantitative benchmark results, traditional approaches are still dominant in achieving high qualitative results. In this paper, the aim is to exploit the best of the two worlds. A method is proposed that (1) is empowered by deep learning capabilities, (2) considers a physics-based reflection model to steer the learning process, and (3) exploits the traditional approach to obtain intrinsic images by exploiting reflectance and shading gradient information. The proposed model is fast to compute and allows for the integration of all intrinsic components. To train the new model, an object centered large-scale datasets with intrinsic ground-truth images are created. The evaluation results demonstrate that the new model outperforms existing methods. Visual inspection shows that the image formation loss function augments color reproduction and the use of gradient information produces sharper edges. Datasets, models and higher resolution images are available at https://ivi.fnwi.uva.nl/cv/retinet. △ Less

Submitted 3 April, 2018; v1 submitted 4 December, 2017; originally announced December 2017.

Comments: CVPR 2018

arXiv:1711.11379 [pdf, other]

3DContextNet: K-d Tree Guided Hierarchical Learning of Point Clouds Using Local and Global Contextual Cues

Authors: Wei Zeng, Theo Gevers

Abstract: Classification and segmentation of 3D point clouds are important tasks in computer vision. Because of the irregular nature of point clouds, most of the existing methods convert point clouds into regular 3D voxel grids before they are used as input for ConvNets. Unfortunately, voxel representations are highly insensitive to the geometrical nature of 3D data. More recent methods encode point clouds… ▽ More Classification and segmentation of 3D point clouds are important tasks in computer vision. Because of the irregular nature of point clouds, most of the existing methods convert point clouds into regular 3D voxel grids before they are used as input for ConvNets. Unfortunately, voxel representations are highly insensitive to the geometrical nature of 3D data. More recent methods encode point clouds to higher dimensional features to cover the global 3D space. However, these models are not able to sufficiently capture the local structures of point clouds. Therefore, in this paper, we propose a method that exploits both local and global contextual cues imposed by the k-d tree. The method is designed to learn representation vectors progressively along the tree structure. Experiments on challenging benchmarks show that the proposed model provides discriminative point set features. For the task of 3D scene semantic segmentation, our method significantly outperforms the state-of-the-art on the Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS). △ Less

Submitted 4 December, 2018; v1 submitted 30 November, 2017; originally announced November 2017.

Comments: 15 pages, 5 figures

arXiv:1412.7957 [pdf, other]

doi 10.1109/TIP.2015.2499702

Detect2Rank : Combining Object Detectors Using Learning to Rank

Authors: Sezer Karaoglu, Yang Liu, Theo Gevers

Abstract: Object detection is an important research area in the field of computer vision. Many detection algorithms have been proposed. However, each object detector relies on specific assumptions of the object appearance and imaging conditions. As a consequence, no algorithm can be considered as universal. With the large variety of object detectors, the subsequent question is how to select and combine them… ▽ More Object detection is an important research area in the field of computer vision. Many detection algorithms have been proposed. However, each object detector relies on specific assumptions of the object appearance and imaging conditions. As a consequence, no algorithm can be considered as universal. With the large variety of object detectors, the subsequent question is how to select and combine them. In this paper, we propose a framework to learn how to combine object detectors. The proposed method uses (single) detectors like DPM, CN and EES, and exploits their correlation by high level contextual features to yield a combined detection list. Experiments on the PASCAL VOC07 and VOC10 datasets show that the proposed method significantly outperforms single object detectors, DPM (8.4%), CN (6.8%) and EES (17.0%) on VOC07 and DPM (6.5%), CN (5.5%) and EES (16.2%) on VOC10. △ Less

Submitted 26 December, 2014; originally announced December 2014.

arXiv:1412.3506 [pdf, other]

Road Detection by One-Class Color Classification: Dataset and Experiments

Authors: Jose M. Alvarez, Theo Gevers, Antonio M. Lopez

Abstract: Detecting traversable road areas ahead a moving vehicle is a key process for modern autonomous driving systems. A common approach to road detection consists of exploiting color features to classify pixels as road or background. These algorithms reduce the effect of lighting variations and weather conditions by exploiting the discriminant/invariant properties of different color representations. Fur… ▽ More Detecting traversable road areas ahead a moving vehicle is a key process for modern autonomous driving systems. A common approach to road detection consists of exploiting color features to classify pixels as road or background. These algorithms reduce the effect of lighting variations and weather conditions by exploiting the discriminant/invariant properties of different color representations. Furthermore, the lack of labeled datasets has motivated the development of algorithms performing on single images based on the assumption that the bottom part of the image belongs to the road surface. In this paper, we first introduce a dataset of road images taken at different times and in different scenarios using an onboard camera. Then, we devise a simple online algorithm and conduct an exhaustive evaluation of different classifiers and the effect of using different color representation to characterize pixels. △ Less

Submitted 17 December, 2014; v1 submitted 10 December, 2014; originally announced December 2014.

Comments: 10 pages

Showing 1–39 of 39 results for author: Gevers, T