Search | arXiv e-print repository

RoIPoly: Vectorized Building Outline Extraction Using Vertex and Logit Embeddings

Authors: Weiqin Jiao, Hao Cheng, Claudio Persello, George Vosselman

Abstract: Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant verti… ▽ More Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant vertices, both resulting in suboptimal performance. To handle these issues, we propose a novel Region-of-Interest (RoI) query-based approach called RoIPoly. Specifically, we formulate each vertex as a query and constrain the query attention on the most relevant regions of a potential building, yielding reduced computational overhead and more efficient vertex level interaction. Moreover, we introduce a novel learnable logit embedding to facilitate vertex classification on the attention map; thus, no post-processing is needed for redundant vertex removal. We evaluated our method on the vectorized building outline extraction dataset CrowdAI and the 2D floorplan reconstruction dataset Structured3D. On the CrowdAI dataset, RoIPoly with a ResNet50 backbone outperforms existing methods with the same or better backbones on most MS-COCO metrics, especially on small buildings, and achieves competitive results in polygon quality and vertex redundancy without any post-processing. On the Structured3D dataset, our method achieves the second-best performance on most metrics among existing methods dedicated to 2D floorplan reconstruction, demonstrating our cross-domain generalization capability. The code will be released upon acceptance of this paper. △ Less

Submitted 20 July, 2024; originally announced July 2024.

arXiv:2407.14912 [pdf, other]

PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction

Authors: Weiqin Jiao, Claudio Persello, George Vosselman

Abstract: Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, whic… ▽ More Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, which offers an efficient and fully integrated approach to predict vectorized building polygons and bounding boxes directly from remotely sensed images. Notably, PolyR-CNN leverages solely the features of the Region of Interest (RoI) for the prediction, thereby mitigating the necessity for complex designs. Furthermore, we propose a novel scheme with PolyR-CNN to extract detailed outline information from polygon vertex coordinates, termed vertex proposal feature, to guide the RoI features to predict more regular buildings. PolyR-CNN demonstrates the capacity to deal with buildings with holes through a simple post-processing method on the Inria dataset. Comprehensive experiments conducted on the CrowdAI dataset show that PolyR-CNN achieves competitive accuracy compared to state-of-the-art methods while significantly improving computational efficiency, i.e., achieving 79.2 Average Precision (AP), exhibiting a 15.9 AP gain and operating 2.5 times faster and four times lighter than the well-established end-to-end method PolyWorld. Replacing the backbone with a simple ResNet-50, PolyR-CNN maintains a 71.1 AP while running four times faster than PolyWorld. △ Less

Submitted 20 July, 2024; originally announced July 2024.

arXiv:2406.11472 [pdf, other]

Learning from Exemplars for Interactive Image Segmentation

Authors: Kun Li, Hao Cheng, George Vosselman, Michael Ying Yang

Abstract: Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing met… ▽ More Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2403.12848 [pdf, other]

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization

Authors: Yao Wei, Martin Renqiang Min, George Vosselman, Li Erran Li, Michael Ying Yang

Abstract: Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with gen… ▽ More Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, large language model is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication. △ Less

Submitted 26 August, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 16 pages, 10 figures

arXiv:2402.03896 [pdf, other]

Convincing Rationales for Visual Question Answering Reasoning

Authors: Kun Li, George Vosselman, Michael Ying Yang

Abstract: Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. It requires deep understanding of both the textual question and visual image. Prior works directly evaluate the answering models by simply calculating the accuracy of the predicted answers. However, the inner reasoning behind the prediction is disregarded in such a "black box… ▽ More Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. It requires deep understanding of both the textual question and visual image. Prior works directly evaluate the answering models by simply calculating the accuracy of the predicted answers. However, the inner reasoning behind the prediction is disregarded in such a "black box" system, and we do not even know if one can trust the predictions. In some cases, the models still get the correct answers even when they focus on irrelevant visual regions or textual tokens, which makes the models unreliable and illogical. To generate both visual and textual rationales next to the predicted answer to the given image/question pair, we propose Convincing Rationales for VQA, CRVQA. Considering the extra annotations brought by the new outputs, {CRVQA} is trained and evaluated by samples converted from some existing VQA datasets and their visual labels. The extensive experiments demonstrate that the visual and textual rationales support the prediction of the answers, and further improve the accuracy. Furthermore, {CRVQA} achieves competitive performance on generic VQA datatsets in the zero-shot evaluation setting. The dataset and source code will be released under https://github.com/lik1996/CRVQA2024. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: under review

arXiv:2309.00158 [pdf, other]

BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models

Authors: Yao Wei, George Vosselman, Michael Ying Yang

Abstract: 3D building generation with low data acquisition costs, such as single image-to-3D, becomes increasingly important. However, most of the existing single image-to-3D building creation works are restricted to those images with specific viewing angles, hence they are difficult to scale to general-view images that commonly appear in practical cases. To fill this gap, we propose a novel 3D building sha… ▽ More 3D building generation with low data acquisition costs, such as single image-to-3D, becomes increasingly important. However, most of the existing single image-to-3D building creation works are restricted to those images with specific viewing angles, hence they are difficult to scale to general-view images that commonly appear in practical cases. To fill this gap, we propose a novel 3D building shape generation method exploiting point cloud diffusion models with image conditioning schemes, which demonstrates flexibility to the input images. By cooperating two conditional diffusion models and introducing a regularization strategy during denoising process, our method is able to synthesize building roofs while maintaining the overall structures. We validate our framework on two newly built datasets and extensive experiments show that our method outperforms previous works in terms of building generation quality. △ Less

Submitted 31 August, 2023; originally announced September 2023.

Comments: 10 pages, 6 figures, accepted to ICCVW2023

arXiv:2308.05515 [pdf]

Mono-hydra: Real-time 3D scene graph construction from monocular camera input with IMU

Authors: U. V. B. L. Udugama, G. Vosselman, F. Nex

Abstract: The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, buildi… ▽ More The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, building these representations using monocular vision systems in real-time remains a difficult task that has not been explored in depth. This paper puts forth a real-time spatial perception system Mono-Hydra, combining a monocular camera and an IMU sensor setup, focusing on indoor scenarios. However, the proposed approach is adaptable to outdoor applications, offering flexibility in its potential uses. The system employs a suite of deep learning algorithms to derive depth and semantics. It uses a robocentric visual-inertial odometry (VIO) algorithm based on square-root information, thereby ensuring consistent visual odometry with an IMU and a monocular camera. This system achieves sub-20 cm error in real-time processing at 15 fps, enabling real-time 3D scene graph construction using a laptop GPU (NVIDIA 3080). This enhances decision-making efficiency and effectiveness in simple camera setups, augmenting robotic system agility. We make Mono-Hydra publicly available at: https://github.com/UAV-Centre-ITC/Mono_Hydra △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 7 pages, 5 figures, GSW 2023 conference paper

arXiv:2307.02280 [pdf, other]

Interactive Image Segmentation with Cross-Modality Vision Transformers

Authors: Kun Li, George Vosselman, Michael Ying Yang

Abstract: Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. Howev… ▽ More Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: 16 pages

arXiv:2303.10386 [pdf, other]

Channel-Aware Distillation Transformer for Depth Estimation on Nano Drones

Authors: Ning Zhang, Francesco Nex, George Vosselman, Norman Kerle

Abstract: Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap, thus suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano drones. To address thi… ▽ More Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap, thus suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano drones. To address this issue this paper presents a lightweight CNN depth estimation network deployed on nano drones for obstacle avoidance. Inspired by Knowledge Distillation (KD), a Channel-Aware Distillation Transformer (CADiT) is proposed to facilitate the small network to learn knowledge from a larger network. The proposed method is validated on the KITTI dataset and tested on a nano drone Crazyflie, with an ultra-low power microprocessor GAP8. △ Less

Submitted 18 March, 2023; originally announced March 2023.

arXiv:2301.09460 [pdf, other]

HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images

Authors: Kun Li, George Vosselman, Michael Ying Yang

Abstract: Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aeria… ▽ More Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code will be released at https://hrvqa.nl/. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:2211.13202 [pdf, other]

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Authors: Ning Zhang, Francesco Nex, George Vosselman, Norman Kerle

Abstract: Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight ar… ▽ More Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters. △ Less

Submitted 15 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

Comments: Accepted to CVPR2023

arXiv:2210.04072 [pdf, other]

Flow-based GAN for 3D Point Cloud Generation from a Single Image

Authors: Yao Wei, George Vosselman, Michael Ying Yang

Abstract: Generating a 3D point cloud from a single 2D image is of great importance for 3D scene understanding applications. To reconstruct the whole 3D shape of the object shown in the image, the existing deep learning based approaches use either explicit or implicit generative modeling of point clouds, which, however, suffer from limited quality. In this work, we aim to alleviate this issue by introducing… ▽ More Generating a 3D point cloud from a single 2D image is of great importance for 3D scene understanding applications. To reconstruct the whole 3D shape of the object shown in the image, the existing deep learning based approaches use either explicit or implicit generative modeling of point clouds, which, however, suffer from limited quality. In this work, we aim to alleviate this issue by introducing a hybrid explicit-implicit generative modeling scheme, which inherits the flow-based explicit generative models for sampling point clouds with arbitrary resolutions while improving the detailed 3D structures of point clouds by leveraging the implicit generative adversarial networks (GANs). We evaluate on the large-scale synthetic dataset ShapeNet, with the experimental results demonstrating the superior performance of the proposed method. In addition, the generalization ability of our method is demonstrated by performing on cross-category synthetic images as well as by testing on real images from PASCAL3D+ dataset. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: 13 pages, 5 figures, accepted to BMVC2022

arXiv:2102.03099 [pdf, other]

Bidirectional Multi-scale Attention Networks for Semantic Segmentation of Oblique UAV Imagery

Authors: Ye Lyu, George Vosselman, Gui-Song Xia, Michael Ying Yang

Abstract: Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep… ▽ More Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variation issue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scales bidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset and have shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection over union (mIoU) score of 70.80%. △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2012.10192 [pdf]

LGENet: Local and Global Encoder Network for Semantic Segmentation of Airborne Laser Scanning Point Clouds

Authors: Yaping Lin, George Vosselman, Yanpeng Cao, Michael Ying Yang

Abstract: Interpretation of Airborne Laser Scanning (ALS) point clouds is a critical procedure for producing various geo-information products like 3D city models, digital terrain models and land use maps. In this paper, we present a local and global encoder network (LGENet) for semantic segmentation of ALS point clouds. Adapting the KPConv network, we first extract features by both 2D and 3D point convoluti… ▽ More Interpretation of Airborne Laser Scanning (ALS) point clouds is a critical procedure for producing various geo-information products like 3D city models, digital terrain models and land use maps. In this paper, we present a local and global encoder network (LGENet) for semantic segmentation of ALS point clouds. Adapting the KPConv network, we first extract features by both 2D and 3D point convolutions to allow the network to learn more representative local geometry. Then global encoders are used in the network to exploit contextual information at the object and point level. We design a segment-based Edge Conditioned Convolution to encode the global context between segments. We apply a spatial-channel attention module at the end of the network, which not only captures the global interdependencies between points but also models interactions between channels. We evaluate our method on two ALS datasets namely, the ISPRS benchmark dataset and DCF2019 dataset. For the ISPRS benchmark dataset, our model achieves state-of-the-art results with an overall accuracy of 0.845 and an average F1 score of 0.737. With regards to the DFC2019 dataset, our proposed network achieves an overall accuracy of 0.984 and an average F1 score of 0.834. △ Less

Submitted 18 December, 2020; originally announced December 2020.

Comments: Submitted to ISPRS Journal of Photogrammetry and Remote Sensing

arXiv:2003.00981 [pdf, other]

Plug & Play Convolutional Regression Tracker for Video Object Detection

Authors: Ye Lyu, Michael Ying Yang, George Vosselman, Gui-Song Xia

Abstract: Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video. One challenge for video object detection is to consistently detect all objects across the whole video. As the appearance of objects may deteriorate in some frames, features or detections from the other frames are commonly used to enhance the prediction. In this p… ▽ More Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video. One challenge for video object detection is to consistently detect all objects across the whole video. As the appearance of objects may deteriorate in some frames, features or detections from the other frames are commonly used to enhance the prediction. In this paper, we propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task, which could be easily and compatibly implanted into the current state-of-the-art detection networks. As the tracker reuses the features from the detector, it is a very light-weighted increment to the detection network. The whole network performs at the speed close to a standard object detector. With our new video object detection pipeline design, image object detectors can be easily turned into efficient video object detectors without modifying any parameters. The performance is evaluated on the large-scale ImageNet VID dataset. Our Plug & Play design improves mAP score for the image detector by around 5% with only little speed drop. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1910.00032 [pdf, other]

LIP: Learning Instance Propagation for Video Object Segmentation

Authors: Ye Lyu, George Vosselman, Gui-Song Xia, Michael Ying Yang

Abstract: In recent years, the task of segmenting foreground objects from background in a video, i.e. video object segmentation (VOS), has received considerable attention. In this paper, we propose a single end-to-end trainable deep neural network, convolutional gated recurrent Mask-RCNN, for tackling the semi-supervised VOS task. We take advantage of both the instance segmentation network (Mask-RCNN) and t… ▽ More In recent years, the task of segmenting foreground objects from background in a video, i.e. video object segmentation (VOS), has received considerable attention. In this paper, we propose a single end-to-end trainable deep neural network, convolutional gated recurrent Mask-RCNN, for tackling the semi-supervised VOS task. We take advantage of both the instance segmentation network (Mask-RCNN) and the visual memory module (Conv-GRU) to tackle the VOS task. The instance segmentation network predicts masks for instances, while the visual memory module learns to selectively propagate information for multiple instances simultaneously, which handles the appearance change, the variation of scale and pose and the occlusions between objects. After offline and online training under purely instance segmentation losses, our approach is able to achieve satisfactory results without any post-processing or synthetic video data augmentation. Experimental results on DAVIS 2016 dataset and DAVIS 2017 dataset have demonstrated the effectiveness of our method for video object segmentation task. △ Less

Submitted 30 September, 2019; originally announced October 2019.

Comments: ICCVW19

arXiv:1904.12586 [pdf]

Robust object extraction from remote sensing data

Authors: Sophie Crommelinck, Mila Koeva, Michael Ying Yang, George Vosselman

Abstract: The extraction of object outlines has been a research topic during the last decades. In spite of advances in photogrammetry, remote sensing and computer vision, this task remains challenging due to object and data complexity. The development of object extraction approaches is promoted through publically available benchmark datasets and evaluation frameworks. Many aspects of performance evaluation… ▽ More The extraction of object outlines has been a research topic during the last decades. In spite of advances in photogrammetry, remote sensing and computer vision, this task remains challenging due to object and data complexity. The development of object extraction approaches is promoted through publically available benchmark datasets and evaluation frameworks. Many aspects of performance evaluation have already been studied. This study collects the best practices from literature, puts the various aspects in one evaluation framework, and demonstrates its usefulness to a case study on mapping object outlines. The evaluation framework includes five dimensions: the robustness to changes in resolution, input, location, parameters, and application. Examples for investigating these dimensions are provided, as well as accuracy measures for their qualitative analysis. The measures consist of time efficiency and a procedure for line-based accuracy assessment regarding quantitative completeness and spatial correctness. The delineation approach to which the evaluation framework is applied, was previously introduced and is substantially improved in this study. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: unpublished study (15 pages)

arXiv:1904.03692 [pdf, other]

Unsupervised Domain Adaptation for Multispectral Pedestrian Detection

Authors: Dayan Guan, Xing Luo, Yanpeng Cao, Jiangxin Yang, Yanlong Cao, George Vosselman, Michael Ying Yang

Abstract: Multimodal information (e.g., visible and thermal) can generate robust pedestrian detections to facilitate around-the-clock computer vision applications, such as autonomous driving and video surveillance. However, it still remains a crucial challenge to train a reliable detector working well in different multispectral pedestrian datasets without manual annotations. In this paper, we propose a nove… ▽ More Multimodal information (e.g., visible and thermal) can generate robust pedestrian detections to facilitate around-the-clock computer vision applications, such as autonomous driving and video surveillance. However, it still remains a crucial challenge to train a reliable detector working well in different multispectral pedestrian datasets without manual annotations. In this paper, we propose a novel unsupervised domain adaptation framework for multispectral pedestrian detection, by iteratively generating pseudo annotations and updating the parameters of our designed multispectral pedestrian detector on target domain. Pseudo annotations are generated using the detector trained on source domain, and then updated by fixing the parameters of detector and minimizing the cross entropy loss without back-propagation. Training labels are generated using the pseudo annotations by considering the characteristics of similarity and complementarity between well-aligned visible and infrared image pairs. The parameters of detector are updated using the generated labels by minimizing our defined multi-detection loss function with back-propagation. The optimal parameters of detector can be obtained after iteratively updating the pseudo annotations and parameters. Experimental results show that our proposed unsupervised multimodal domain adaptation method achieves significantly higher detection performance than the approach without domain adaptation, and is competitive with the supervised multispectral pedestrian detectors. △ Less

Submitted 7 April, 2019; originally announced April 2019.

arXiv:1810.10438 [pdf, other]

UAVid: A Semantic Segmentation Dataset for UAV Imagery

Authors: Ye Lyu, George Vosselman, Guisong Xia, Alper Yilmaz, Michael Ying Yang

Abstract: Semantic segmentation has been one of the leading research interests in computer vision recently. It serves as a perception foundation for many fields, such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. There already exist several semantic segmentation datasets f… ▽ More Semantic segmentation has been one of the leading research interests in computer vision recently. It serves as a perception foundation for many fields, such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. There already exist several semantic segmentation datasets for comparison among semantic segmentation methods in complex urban scenes, such as the Cityscapes and CamVid datasets, where the side views of the objects are captured with a camera mounted on the driving car. There also exist semantic labeling datasets for the airborne images and the satellite images, where the top views of the objects are captured. However, only a few datasets capture urban scenes from an oblique Unmanned Aerial Vehicle (UAV) perspective, where both of the top view and the side view of the objects can be observed, providing more information for object recognition. In this paper, we introduce our UAVid dataset, a new high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving object recognition and temporal consistency preservation. Our UAV dataset consists of 30 video sequences capturing 4K high-resolution images in slanted views. In total, 300 images have been densely labeled with 8 classes for the semantic labeling task. We have provided several deep learning baseline methods with pre-training, among which the proposed Multi-Scale-Dilation net performs the best via multi-scale feature extraction. Our UAVid website and the labeling tool have been published https://uavid.nl/. △ Less

Submitted 18 May, 2020; v1 submitted 24 October, 2018; originally announced October 2018.

Comments: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing

arXiv:1807.09562 [pdf, other]

Change Detection between Multimodal Remote Sensing Data Using Siamese CNN

Authors: Zhenchao Zhang, George Vosselman, Markus Gerke, Devis Tuia, Michael Ying Yang

Abstract: Detecting topographic changes in the urban environment has always been an important task for urban planning and monitoring. In practice, remote sensing data are often available in different modalities and at different time epochs. Change detection between multimodal data can be very challenging since the data show different characteristics. Given 3D laser scanning point clouds and 2D imagery from… ▽ More Detecting topographic changes in the urban environment has always been an important task for urban planning and monitoring. In practice, remote sensing data are often available in different modalities and at different time epochs. Change detection between multimodal data can be very challenging since the data show different characteristics. Given 3D laser scanning point clouds and 2D imagery from different epochs, this paper presents a framework to detect building and tree changes. First, the 2D and 3D data are transformed to image patches, respectively. A Siamese CNN is then employed to detect candidate changes between the two epochs. Finally, the candidate patch-based changes are grouped and verified as individual object changes. Experiments on the urban data show that 86.4\% of patch pairs can be correctly classified by the model. △ Less

Submitted 25 July, 2018; originally announced July 2018.

arXiv:1807.09546 [pdf]

Patch-based Evaluation of Dense Image Matching Quality

Authors: Zhenchao Zhang, Markus Gerke, George Vosselman, Michael Ying Yang

Abstract: Airborne laser scanning and photogrammetry are two main techniques to obtain 3D data representing the object surface. Due to the high cost of laser scanning, we want to explore the potential of using point clouds derived by dense image matching (DIM), as effective alternatives to laser scanning data. We present a framework to evaluate point clouds from dense image matching and derived Digital Surf… ▽ More Airborne laser scanning and photogrammetry are two main techniques to obtain 3D data representing the object surface. Due to the high cost of laser scanning, we want to explore the potential of using point clouds derived by dense image matching (DIM), as effective alternatives to laser scanning data. We present a framework to evaluate point clouds from dense image matching and derived Digital Surface Models (DSM) based on automatically extracted sample patches. Dense matching error and noise level are evaluated quantitatively at both the local level and whole block level. Experiments show that the optimal vertical accuracy achieved by dense matching is as follows: the mean offset to the reference data is 0.1 Ground Sampling Distance (GSD); the maximum offset goes up to 1.0 GSD. When additional oblique images are used in dense matching, the mean deviation, the variation of mean deviation and the level of random noise all get improved. We also detect a bias between the point cloud and DSM from a single photogrammetric workflow. This framework also allows to reveal inhomogeneity in the distribution of the dense matching errors due to over-fitted BBA network. Meanwhile, suggestions are given on the photogrammetric quality control. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: 16 pages

Journal ref: International Journal of Applied Earth Observation and Geoinformation, 2018

arXiv:1709.01813 [pdf]

Towards Automated Cadastral Boundary Delineation from UAV Data

Authors: Sophie Crommelinck, Michael Ying Yang, Mila Koeva, Markus Gerke, Rohan Bennett, George Vosselman

Abstract: Unmanned aerial vehicles (UAV) are evolving as an alternative tool to acquire land tenure data. UAVs can capture geospatial data at high quality and resolution in a cost-effective, transparent and flexible manner, from which visible land parcel boundaries, i.e., cadastral boundaries are delineable. This delineation is to no extent automated, even though physical objects automatically retrievable t… ▽ More Unmanned aerial vehicles (UAV) are evolving as an alternative tool to acquire land tenure data. UAVs can capture geospatial data at high quality and resolution in a cost-effective, transparent and flexible manner, from which visible land parcel boundaries, i.e., cadastral boundaries are delineable. This delineation is to no extent automated, even though physical objects automatically retrievable through image analysis methods mark a large portion of cadastral boundaries. This study proposes (i) a workflow that automatically extracts candidate cadastral boundaries from UAV orthoimages and (ii) a tool for their semi-automatic processing to delineate final cadastral boundaries. The workflow consists of two state-of-the-art computer vision methods, namely gPb contour detection and SLIC superpixels that are transferred to remote sensing in this study. The tool combines the two methods, allows a semi-automatic final delineation and is implemented as a publicly available QGIS plugin. The approach does not yet aim to provide a comparable alternative to manual cadastral mapping procedures. However, the methodological development of the tool towards this goal is developed in this paper. A study with 13 volunteers investigates the design and implementation of the approach and gathers initial qualitative as well as quantitate results. The study revealed points for improvement, which are prioritized based on the study results and which will be addressed in future work. △ Less

Submitted 6 September, 2017; originally announced September 2017.

Comments: Report on current state (August 2017) of PhD work of first author. Further info: https://its4land.com/automate-it-wp5/

Showing 1–22 of 22 results for author: Vosselman, G