An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera

Gao, Xuanchang; Yu, Junzhi; Tan, Min

doi:10.3390/s23052871

Open AccessArticle

An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera

by

Xuanchang Gao

^1,2,*,

Junzhi Yu

^1,3

and

Min Tan

^1,2

¹

State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

²

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

³

State Key Laboratory for Turbulence and Complex Systems, Department of Advanced Manufacturing and Robotics, Beijing Innovation Center for Engineering Science and Advanced Technology (BIC-ESAT), College of Engineering, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(5), 2871; https://doi.org/10.3390/s23052871

Submission received: 3 February 2023 / Revised: 24 February 2023 / Accepted: 3 March 2023 / Published: 6 March 2023

(This article belongs to the Special Issue Recent Advances in Robotics and Intelligent Mechatronics Systems)

Download

Browse Figures

Versions Notes

Abstract

:

3D modeling plays a significant role in many industrial applications that require geometry information for pose measurements, such as grasping, spraying, etc. Due to random pose changes in the workpieces on the production line, demand for online 3D modeling has increased and many researchers have focused on it. However, online 3D modeling has not been entirely determined due to the occlusion of uncertain dynamic objects that disturb the modeling process. In this study, we propose an online 3D modeling method under uncertain dynamic occlusion based on a binocular camera. Firstly, focusing on uncertain dynamic objects, a novel dynamic object segmentation method based on motion consistency constraints is proposed, which achieves segmentation by random sampling and poses hypotheses clustering without any prior knowledge about objects. Then, in order to better register the incomplete point cloud of each frame, an optimization method based on local constraints of overlapping view regions and a global loop closure is introduced. It establishes constraints in covisibility regions between adjacent frames to optimize the registration of each frame, and it also establishes them between the global closed-loop frames to jointly optimize the entire 3D model. Finally, a confirmatory experimental workspace is designed and built to verify and evaluate our method. Our method achieves online 3D modeling under uncertain dynamic occlusion and acquires an entire 3D model. The pose measurement results further reflect the effectiveness.

Keywords:

3D modeling; online; uncertain dynamic occlusion; segmentation; pose measurement

1. Introduction

With the increasing demands for intelligent manufacturing and industrial automation, digital three-dimensional (3D) models of workpieces hold increasingly important status in production processes [1,2], such as grasping, spraying, etc., which refer to the whole 3D point cloud rather than the 2.5D point cloud in a single frame captured by the depth sensor.

3D modeling applied in factories can be performed offline or online [3,4,5]. Offline 3D modeling methods can achieve high accuracy and are often used in high precision applications, such as defect detection. However, their efficiency is low and they have ex situ processes. Online 3D modeling is often used to measure the pose of the workpiece as a standard guide [6,7,8]. For example, the pose of the workpiece is necessary for robotic spray painting, and the 3D model is an important prerequisite for pose measurement. Because random pose changes often occur during workpiece processing on the production line, online 3D modeling becomes more important for pose measurement. Moreover, an entire 3D model contains rich geometric information, which can guarantee the accuracy of pose measurement, while the 2.5D point cloud obtained from a single view is not sensitive to the shape of the workpiece, resulting in a decrease in the accuracy of the pose. Although there are many online 3D modeling methods, online 3D modeling has not been entirely determined because it is often occluded by uncertain dynamic objects, such as tools, etc. Occlusion also makes the workpiece point cloud of each frame incomplete, which will affect the accuracy and reliability of registration.

Common online 3D modeling methods can be divided into two categories. One is to arrange multiple depth sensors around the workpiece, and register point clouds collected from multiple perspectives [2,7,9,10]. Wang et al. [2] arranged multiple laser scanners around the aircraft to obtain point clouds from multiple perspectives and calculated the initial poses by detecting markers placed around the aircraft in advance. They also proposed a density-invariant area-based method to improve registration accuracy. Finally, the 3D model was used to measure the poses of the key parts of aircraft. Gao et al. [7] modeled the workpiece by registering point clouds captured by four precisely calibrated RGB-D cameras arranged around the workpiece. Bi et al. [9] also utilized multiple laser scanners and registered point clouds from multiple views to acquire the entire point clouds for different kinds of complex workpieces. Then, 3D models were utilized to estimate the relative poses to the standard position for the corresponding spraying operations. These methods are convenient and fast, but they cannot handle fixed occlusion since all sensors are fixed. Furthermore, their cost is very high and the accuracy is sensitive to the sensor calibration parameters. In many cases, sensors need to be frequently calibrated. The other online 3D modeling method is to only use one depth sensor moving around the workpiece and fuse the depth data of each frame into the global implicit surface model of the workpiece [1,11,12]. KinectFusion [11] is a typical representative, which registers the point cloud of each frame with a continuously updated model based on the truncated signed distance function voxel approach. Inspired by KinectFusion, Lin et al. [12] used a depth camera rotating around the workpiece to collect depth images and used the KinectFusion algorithm to build an entire 3D model for estimating the current pose and adjusting the robot path. These methods can achieve good performance, but their accuracy will decrease owing to occlusion. Both categories can achieve online 3D modeling, but neither can handle the possible dynamic occlusion.

Recently, many researchers have started to focus on the occlusion problems of dynamic objects to find some ways to segment dynamic objects. Existing methods can be classified into three categories: optical flow-based, point cloud clustering-based, and learning-based methods. A traditional segmentation method is optical flow [13], where the motion can be segmented by the inconsistency of feature point velocities. FlowFusion [13] applies the initial pose estimated by intensity and depth image pairs and dense optical flow of the scene obtained by the Convolutional Neural Network for optical flow using Pyramid, Warping, and Cost volume (PWC-Net) [14] to calculate the 2D scene flow generated by the movement of the objects, so as to achieve the purpose of segmenting dynamic objects. The problem is that it cannot handle large motion due to the assumption of grayscale invariance. Considering that the points on the same object have similar property, some researchers have adopted point cloud clustering methods to segment possible dynamic and static objects [15,16], which are good at mining a small set of points, the motion patterns of which are significantly different from the other parts of the point cloud. StaticFusion [15] adds K-means clustering [17] and static background segmentation of point clouds, which treats the static probability of a pixel in the current frame as a continuous variable that is jointly optimized with poses. Joint-Visual Odometry-Scene Flow (Joint-VO-SF) [16] performs K-means clustering [17] in the 3D point cloud and treats each cluster as a rigid body for pose estimation. The initial estimations of poses are used for dynamic and static object segmentation. Nevertheless, since the motion and size of objects in applications are uncertain, they are not suitable for scenarios with high uncertainty. With the development of the deep learning technique, some object detection and segmentation networks are utilized to extract dynamic objects [18,19,20]. Co-Fusion [18] applies SharpMask [21] and conditional random field-based motion segmentation to detect moving objects. Mid-Fusion [19] applies Mask Regions with Convolutional Neural Network Features (R-CNN) [22] instance segmentation and MaskFusion [23] edge refinement to detect dynamic objects and implement static dense reconstruction based on ElasticFusion [24]. PoseFusion [20] mainly aims at human life scenarios. It uses the OpenPose network [25] to identify human joints as priors to perform a minimum cut to propose human body regions and achieve background reconstruction. These methods are suitable for scenes where objects are known beforehand. However, the number of object samples in many scenarios often do not meet the training requirements of learning-based methods and the appearances of offline and online are different. In conclusion, learning-based methods rely on priors and have limited adaptability, and the first two methods are weak in terms of robustness. The above methods cannot meet the needs of 3D modeling under uncertain dynamic occlusion.

In this paper, aiming at the uncertain dynamic occlusion problem, an online 3D modeling method based on a single binocular camera is proposed. This method does not rely on prior information about objects, and can only achieve the robust segmentation of feature points and corresponding point clouds (2D-3D) by analyzing the motion patterns of adjacent frames. Through exploring the distribution of pose hypotheses calculated using 2D-3D point pairs that are obtained by multiple random sampling, 2D-3D point pairs corresponding to dense regions of the pose distribution are chosen as the segmentation results. Initial poses between adjacent frames can be acquired from these 2D-3D point pairs. On this basis, considering the incomplete point cloud of the occluded target object obtained after segmentation, an optimization method based on local constraints of covisibility regions and global loop closure is proposed to optimize the local registration and the entire 3D model. Furthermore, we design and build an experimental workspace to verify the feasibility and effectiveness of our method.

2. Methods

Aimed at the task of 3D modeling under uncertain dynamic occlusion, we propose an online 3D modeling method, as shown in Figure 1. It mainly includes four steps: raw data processing and data association, dynamic objects segmentation, 3D modeling of the target object and dynamic objects, and closed-loop optimization (bundle adjustment). For the original images captured by the binocular camera, stereo rectification [26] is performed to align the epipolar lines of the left and right images. Then, the depth information of a single frame can be calculated by disparity estimation methods. After that, data association between adjacent frames is established through feature point matching, image block correlation, or auxiliary texture. That is to say, it is necessary to obtain the image point—image point correspondences (2D-2D) and the image point—3D point correspondences (2D-3D) between adjacent frames. On this basis, dynamic object segmentation is carried out to divide these 2D-3D correspondences into different objects with different motions. The initial pose of each object is estimated by these 2D-3D correspondences. Subsequently, the initial poses and points are optimized by local constraints between adjacent frames. The constraints between multiple closed-loop frames with covisibility regions are also added to the optimization. Finally, poses and 3D models of target objects and dynamic objects are acquired. In the following, uncertain dynamic object segmentation and 3D modeling based on closed-loop optimization are described in detail.

2.1. Object Segmentation by Clustering

Model for Dynamic Objects Segmentation. Given two consecutive images, the feature point correspondences

\{p_{i}\}

and

\{q_{i}\}

can be obtained first. For the dynamic object segmentation problem, we need to divide feature point correspondences so that the feature points in each divided point set conform to one motion model, the total number of motion models should be as small as possible, and the number of points in each set should be as many as possible. Therefore, the multiple dynamic object segmentation problem can be modeled as

S_{j = 1 \to n_{τ}^{*}}^{*} = argmin n_{τ} \sum_{j = 1}^{n_{τ}} (\sum_{p_{i} \in S_{j}} r (p_{i}, M_{S_{j}}) + V (S_{j})),

(1)

where

S_{j}

is

j

-th segmented feature point set,

M_{S_{j}}

means the motion model satisfied by points in

S_{j}

,

n_{τ}

denotes the number of divided objects, and

V (S_{j})

represents the reciprocal of the number of points in

S_{j}

.

r (p_{i}, M_{S_{j}})

represents the reprojection error of a point

p_{i}

on the motion model

M_{S_{j}}

. It is defined as follows:

r (p_{i}, M_{S_{j}}) = {‖{p_{i}} - \frac{1}{s_{i}} K M_{S_{j}} {P_{i}}‖}_{2},

(2)

where

s_{i}

is the depth of

p_{i}

,

P_{i}

is a 3D point that can be computed by triangulation.

This model combines the complexity of segmentation, the number of inliers and the motion consistency of the object, where

n_{τ}

reflects the complexity of segmentation and

V (S_{j})

implies the number of inliers of each set. The motion consistency means that feature points belonging to the same object satisfy the same epipolar constraint:

{(p_{i})}^{T} K^{- T} t^{⋀} R K^{- 1} q_{i} = 0,

(3)

where

K

refers to the camera intrinsic matrix and

t^{⋀}

means the antisymmetric matrix of translation vector

t

.

[R, t]

is the pose transformation matrix.

The best segmentation

S_{j = 1 \to n_{τ}^{*}}^{*}

can be obtained by (1). In order to solve (1), we propose a segmentation method based on cluster analysis below.

Cluster Analysis. A pose hypothesis can be obtained by one sampling in point correspondences, where one sampling represents sampling the smallest set of point pairs that are sufficient to compute a pose hypothesis. The pose hypotheses calculated from point pairs sampled on the same object must coincide in the pose parameter space. Based on this idea, cluster analysis can be achieved (1) by finding the dense regions in pose hypotheses and object segmentation can be acquired.

(1) Pose Hypothesis Calculation. To compute the pose hypothesis

M_{S_{j}}

, we consider the fundamental matrix

F

and the homography matrix

H

. We use the eight-point [27] method and the four-point [28] method to calculate

F

and

H

. Therefore, we have

ξ = g (P, Q),

(4)

where

ξ \in s e (3)

is the Lie algebra form of

M_{S_{j}}

,

P

and

Q

are the sampled matching point sets.

g (\cdot, \cdot)

represents the algorithm for solving the pose. We need to select methods according to their applicability, limitations, and possible degradation [29]. In this paper, because the depth of points can be obtained, the perspective-n-point (PnP) method [30] is adopted to calculate the pose hypotheses. Here, the PnP algorithm requires at least four 2D-3D point pairs to compute a pose.

(2) Pose Hypotheses Clustering. Based on these pose hypotheses, the clustering algorithm is carried out. Euclidean distance in the Lie algebra coordinate system is adopted to measure the distance between two pose hypotheses.

d_{i j} = ‖ξ_{i}, ξ_{j}‖ .

(5)

Considering that the number of objects in the scene is unknown, the cluster method proposed in [31] is introduced. This method can not only automatically and accurately determine the number of cluster centers, but is also very fast. After clustering, we select the neighborhood of each cluster center as a region

Ω_{l}

. Usually, the number of regions is the same as the number of objects. We only need to re-use the 2D-3D point pairs that calculate the pose hypotheses in each region

Ω_{l}

to compute a new pose, which is the pose of the object. However, due to the randomness of sampling, there may be regions made up of noise. To remove noise regions, we adopt the idea of the ordered residual kernel (ORK) [32]. We sort the number

η_{l}

of poses in each region

Ω_{l}

in descending order

\{{\hat{η}}_{l 1}, {\hat{η}}_{l 2}, \dots, {\hat{η}}_{l L}\}

, where

\forall k : {\hat{η}}_{l k} \geq {\hat{η}}_{l k + 1}

. For one region

Ω_{l k}

with

{\hat{η}}_{l k}

pose hypotheses, the 2D-3D point pairs used to calculate pose hypotheses

{{}^{Ω_{l k}}ξ}_{i}

in

Ω_{l k}

are taken out and considered as effective points:

({}^{Ω_{l k}}{\hat{P}}_{i}, {}^{Ω_{l k}}{\hat{Q}}_{i}) = G ({{}^{Ω_{l k}}ξ}_{i}), {{}^{Ω_{l k}}ξ}_{i} \in Ω_{l k},

(6)

where

i = 1, 2, \dots, {\hat{η}}_{l k}

,

{\hat{P}}_{l k} = \{{}^{Ω_{l k}}{P_{i}}\}

,

{\hat{Q}}_{l k} = \{{}^{Ω_{l k}}{Q_{i}}\}

.

{{}^{Ω_{l k}}ξ}_{i}

denotes a pose hypothesis in

Ω_{l k}

.

G (\cdot)

means taking out the 2D-3D point pairs

({\hat{P}}_{l k}, {\hat{Q}}_{l k})

that calculate all hypotheses in

Ω_{l k}

. For

({\hat{P}}_{l k}, {\hat{Q}}_{l k})

, the number of point pairs from different objects has already been significantly reduced. In other words, such an operation causes most of these 2D-3D point pairs to come from the same object. We carry out multiple sampling in

({\hat{P}}_{l k}, {\hat{Q}}_{l k})

once more and calculate pose hypotheses. The cluster algorithm in [31] is used again to find cluster centers. The regions with only one cluster center are chosen and the regions with more than one cluster center are rejected. Finally, each remaining region corresponds to an object. The pose of each object can be calculated by 2D-3D point pairs corresponding to pose hypotheses in each region. The above process is also summarized as Algorithm 1.

Algorithm 1. Pose Estimation for Each Cluster Region

Ω_{l k}

Input: feature points

({\hat{P}}_{l k}, {\hat{Q}}_{l k})

taken out from

Ω_{l k}

.

Output: rigid body pose

ξ_{t}^{*}

and corresponding points

({\hat{P}}_{t}^{*}, {\hat{Q}}_{t}^{*})

.

1 carry out multiple sampling in

({\hat{P}}_{l k}, {\hat{Q}}_{l k})

and obtain

(P_{j}, Q_{j})

,

j = 1, \dots, N

;

2

ξ_{j} \leftarrow g (P_{j}, Q_{j}), j = 1, 2, \dots, N

;

3 cluster

\{ξ_{j}\}, j = 1, 2, \dots, N

by [31]

4 obtain cluster centers

{}^{Ω_{l}^{}}{ξ_{c}}

, region

Ω_{l}^{}

, poses

{}^{Ω_{l}^{}}{ξ_{k_{l}, k_{l} = 1, \dots, η_{l}}}

,

l = 1, \dots, L

;

5 sort all regions in descending order according to

η_{l}

,

Ω_{l 1}^{}, Ω_{l 2}^{}, Ω_{l 3}^{}, \dots

6 if there is only one cluster center, that is,

L = 1

and is significant

7

({\hat{P}}_{t}, {\hat{Q}}_{t}) \leftarrow G ({}^{Ω_{l}^{}}{ξ_{k_{l}, k_{l} = 1, . ., η_{l}}})

;

8

ξ_{t}^{*} \leftarrow g ({\hat{P}}_{t}, {\hat{Q}}_{t})

and select the inliers

({\hat{P}}_{t}^{*}, {\hat{Q}}_{t}^{*})

that satisfies

ξ_{t}^{*}

;

9 return

ξ_{t}^{*}

and

({\hat{P}}_{t}^{*}, {\hat{Q}}_{t}^{*})

;

10 end if

11 reject the region

Ω_{l k}

2.2. 3D Modeling Based on Graph Optimization

Notation. We take the coordinate system of object

O_{1}

as the reference coordinate system and illustrate the notations in Figure 2. Each object coordinate system is established at the centroid of this object. Let

{{}^{O_{1}}T}_{C}^{k}, {{}^{O_{1}}T}_{O_{l}}^{k} \in S E (3)

denote the poses of the camera

C

and object

O_{l}

in the

O_{1}

coordinate system at the

k

-th frame, respectively, where

k \in F

,

l

is the

l

-th object, and

l = 2, \dots, N_{L}, l \neq 1

.

N_{L}

is the total number of the objects tracked in

k

-th frame. In

k

-th frame

I_{k}

, an image feature point on

O_{1}

is denoted by

{{}^{I_{k}}p}^{i} = {[{}^{I_{k}}{p_{u}^{i}}, {}^{I_{k}}{p_{v}^{i}, 1}]}^{T}

,

i = 1, 2, \dots, N_{u v}

. It is a homogeneous coordinate representation. Similarly, an image feature point on

O_{l}

in

I_{k}

is represented by

{{}_{O_{l}}^{I_{k}}p}^{j}

,

j = 1, 2, \dots, N_{u v}^{O_{l}}

.

{{}^{O_{1}}P}_{k}^{i}

and

{{}_{O_{l}}^{O_{1}}P}_{k}^{j}

mean the

i

-th 3D point on

O_{1}

and

j

-th 3D point on

O_{l}

in the

O_{1}

coordinate system in frame

I_{k}

, respectively. Let

{{}^{O_{1}}T}_{C}^{k}

,

{{}^{O_{1}}T}_{O_{l}}^{k}

be the pose transformation matrix of the camera

C

and object

O_{l}

in the

O_{1}

coordinate system at

I_{k}

, where the camera pose is the pose of the object

O_{1}

itself. We use the notation

{}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}}

to term the motion of

O_{l}

from

I_{k - 1}

to

I_{k}

.

Constraints and Factor Graph Establishment. After object segmentation, the 2D-3D corresponding point pairs and initial pose of each object can be obtained. Due to which object is the target object is unknown, all objects need to be estimated. Moreover, only using these initial poses cannot register the point clouds well because the point cloud of each frame is incomplete caused by occlusion. Therefore, an optimization method based on local constraints and global loop closure is introduced to acquire accurate poses.

(1) Reprojection Constraints. For a spatial 3D point

{{}^{O_{1}}P}_{k - 1}^{i}

on

O_{1}

in the

O_{1}

coordinate system, according to the camera projection model, we have

{{}^{I_{k}}p}^{i} = π ({({{}^{O_{1}}T}_{C}^{k})}^{- 1} {{}^{O_{1}}P}_{k - 1}^{i}) = K {[{({{}^{O_{1}}T}_{C}^{k})}^{- 1} {{}^{O_{1}}P}_{k - 1}^{i}]}_{1 : 3},

(7)

where,

π (\cdot)

is the projection function,

K

is the

3 \times 3

camera intrinsic matrix.

{[\cdot]}_{1 : 3}

means to take the first three rows of the matrix. In other words, we take the first three components of the

4 \times 1

column vector

{({{}^{O_{1}}T}_{C}^{k})}^{- 1} {{}^{O_{1}}P}_{k - 1}^{i}

. Given a set of 3D points

\{{{}^{O_{1}}P}_{k - 1}^{i} | k \in F, i \in Γ\}

and its corresponding 2D image points

\{{}^{I_{k}}{\tilde{p}}^{i} | k \in F, i \in Γ\}

observed in

I_{k}

where

Γ

means the index set of corresponding point pairs, we can compute the reprojection error:

{{}^{I_{k}}e}_{r e p r o j}^{i} = {}^{I_{k}}{\tilde{p}}^{i} - π ({({{}^{O_{1}}T}_{C}^{k})}^{- 1} {{}^{O_{1}}P}_{k - 1}^{i}), i \in Γ .

(8)

For a 3D point

{{}_{O_{l}}^{O_{1}}P}_{k}^{i}

on object

O_{l}

, which can be acquired by

{{}^{O_{1}}T}_{C}^{k} {{}_{O_{l}}^{C}P}_{k}^{i}

, where

{{}_{O_{l}}^{C}P}_{k}^{i}

is 3D point on

O_{l}

in camera coordinate system, no matter how this object moves, on account of the object motion consistency, the coordinates of a 3D point

{{}_{O_{l}}^{O_{l}}P}_{k}^{i}

on

O_{l}

in

O_{l}

coordinate system do not change no matter how object

O_{l}

moves. Therefore, we have

{{}_{O_{l}}^{O_{l}}P}_{φ}^{i} = {({{}^{O_{1}}T}_{O_{l}}^{k})}^{- 1} {{}_{O_{l}}^{O_{1}}P}_{k}^{i} = {({{}^{O_{1}}T}_{O_{l}}^{φ})}^{- 1} {{}_{O_{l}}^{O_{1}}P}_{φ}^{i}, φ \in F,

(9)

Here,

{{}_{O_{l}}^{O_{l}}P}_{φ}^{i}

denotes a 3D point of object

O_{l}

in the object

O_{l}

coordinate system and

φ \in F

can be any frame. When

φ = k - 1

, the motion of 3D point

{{}_{O_{l}}^{O_{l}}P}_{φ}^{i}

can be derived by multiplying

{{}^{O_{1}}T}_{O_{l}}^{k}

on the left and right sides of the second equal sign of (9):

{{}_{O_{l}}^{O_{1}}P}_{k}^{i} = {{}^{O_{1}}T}_{O_{l}}^{k} {({{}^{O_{1}}T}_{O_{l}}^{k - 1})}^{- 1} {{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i} = {}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}} {{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i},

(10)

where

{}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}}

is the motion of object

O_{l}

and can transform 3D points on object

O_{l}

in frame

k - 1

to frame

k

in the

O_{1}

coordinate system. On this basis, the reprojection constraints of the object

O_{l}

can be established. Given the 2D-3D point pairs on

O_{l}

,

\{{}_{O_{l}}^{I_{k}}{\tilde{p}}^{i} | i \in {}_{O_{l}}Γ\}

in

I_{k}

and the 3D points

\{{{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i} | k \in F, i \in {}_{O_{l}}Γ\}

in

I_{k - 1}

in the

O_{1}

coordinate system, the reprojection error is

{{}_{O_{l}}^{I_{k}}e}_{r e p r o j}^{i} = {}_{O_{l}}^{I_{k}}{\tilde{p}}^{i} - π ({({{}^{O_{1}}T}_{C}^{k})}^{- 1} {}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}} {{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i}) = {}_{O_{l}}^{I_{k}}{\tilde{p}}^{i} - π ({}_{O_{l}}^{C}{T_{k - 1}^{k}} {{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i}) .

(11)

Here,

{}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}} = {{}^{O_{1}}T}_{C}^{k} {}_{O_{l}}^{C}{T_{k - 1}^{k}}

can also be refined by optimization of

{{}^{O_{1}}T}_{C}^{k}

and

{}_{O_{l}}^{C}{T_{k - 1}^{k}}

.

(2) Object Motion Constraints. On the basis of object motion

{}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}}

, the 3D point in

I_{k - 1}

after this motion transformation should be the same as the corresponding 3D point in

I_{k}

. Accordingly, we establish the object motion constraints:

{{}_{O_{l}}^{O_{1}}e}_{m o t i o n}^{i} = {{}_{O_{l}}^{O_{1}}P}_{k}^{i} - {}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}} {{}_{O_{l}}^{O_{1}}P}_{k - 1}^{i} .

(12)

For all points on object

O_{l}

, they satisfy the same matrix

{}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}}

.

(3) Smoothness Constraints. Taking into account the smoothness of the motion of the objects

O_{l}

between adjacent camera frames, the smoothness constraint is defined as

{{}^{O_{l}}e}_{s m o o t h}^{k, k - 1, k - 2} = {({}_{O_{l}}^{O_{1}}{T_{k - 2}^{k - 2}})}^{- 1} {}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}} .

(13)

That is to say, the moving speed of dynamic objects between adjacent frames is roughly the same.

All of the above constraints are added to the factor graph. The variables that need to be optimized are

Ψ = \{{{}^{O_{1}}T}_{C}^{k}, {{}^{O_{1}}P}_{k}^{i}, {}_{O_{l}}^{O_{1}}{T_{k - 1}^{k}}, {{}_{O_{l}}^{O_{1}}P}_{k}^{i}\}

. Then, the bundle adjustment (BA) problem can be written as

Ψ^{*} = \min_{Ψ} \sum_{k \in F} ({‖{{}^{I_{k}}e}_{r e p r o j}^{i}‖}_{}^{2} + {‖{{}_{O_{l}}^{I_{k}}e}_{r e p r o j}^{i}‖}_{}^{2} + {‖{{}_{O_{l}}^{O_{1}}e}_{m o t i o n}^{i}‖}_{}^{2} + {‖{{}^{O_{l}}e}_{s m o o t h}^{k, k - 1, k - 2}‖}_{}^{2}),

(14)

Then, (14) can be solved using the Levenberg-Marquardt method [33] to optimize the factor graph.

Here, we not only establish the above constraint relationships between adjacent frames, but also establish constraints between closed-loop frames with covisibility regions, so as to further optimize relevant variables. After optimization, 3D models and object poses can be acquired at last.

3. Results

Our method was verified on an experimental workspace built by ourselves. We modeled three objects under uncertain dynamic occlusion in the workspace and used the modeling results to estimate poses. The three objects were a complex surface part (

W_{a}

), a 3D printed block (

W_{b}

), and an earphone case (

W_{c}

), respectively. Their sizes were

39.52 m m \times 24.61 m m \times 38.00 m m

,

40.00 m m \times 36.00 m m \times 30.47 m m

, and

47.10 m m \times 55.24 m m \times 28.67 m m

, respectively. In the following experiments, we only used

W_{a}

as an example to illustrate.

3.1. Introduction of Confirmatory Experimental Workspace

As shown in Figure 3, our workspace was mainly composed of a machining system, a positioner, and a vision system. The machining system consisted of a robotic arm UR5 and an end effector. Tools or other objects were clamped by the end effector and the robotic arm drove tools to perform the related operations, which could be grasping, spraying, etc. Of course, there were many kinds of tools that were not known in advance. The positioner involved a turntable and a rotating shaft. The target object was installed in the slot of the turntable. These two devices could drive the target object to change poses and they consisted of two high-precision servo motors that enabled accurate rotational control.

The vision system mainly included two high-resolution Basler cameras acA1920-50 gm (2.3 million pixels) and an auxiliary texture device. Two cameras were symmetrically arranged with a large baseline. The purpose was to improve the accuracy of depth measurements. By adjusting the focal length, the two cameras could focus on the central axis of the turntable. The auxiliary texture device was installed between two cameras. It needed to be simultaneous with the camera to provide the texture when some untextured objects were measured. After the cameras and the auxiliary texture device were installed and fixed, the calibration of the intrinsic and extrinsic parameters of the two cameras was performed. The stereo calibration method [34] was adopted to acquire these parameters. After that, stereo rectification [26] was implemented to align the epipolar lines of the left and right images. On this basis, the depth information in a single frame could be obtained by disparity or depth estimation methods. The texture of objects or auxiliary texture could be used to acquire point correspondences. The camera was also mounted on a platform controlled by servo motors and could be driven to move.

In our experiment, the camera was fixed and the target object was rotated by a positioner during modeling. To test the effectiveness of the proposed method, we kept the tool moving between the target object and the camera so that the target object was occluded throughout the whole process. The auxiliary texture was utilized to obtain the corresponding points of the left and right images. The resolution of captured images was

1920 \times 1200

. Since the background was removed, the actual maximum resolution of the image occupied by the target object and tool did not exceed

1000 \times 1200

. The vision system continuously observed the object and generated 2.5D point clouds within the field of view.

3.2. Demonstration of the Clustering Process

Dynamic object segmentation can be implemented by the algorithm described in Section 2.1. Before segmentation, the image corresponding points (2D-2D) of the current frame and the last frame are first obtained using the Grid-based Motion Statistics (GMS) method [35] and image block correlation [36]. Because 3D points corresponding to the image points of the last frame are known, 2D-2D point pairs can be converted into the correspondences between the image points in the current frame and the 3D points in the last frame (2D-3D point pairs). Based on these 2D-3D point pairs, the segmentation method is carried out.

In order to present the clustering process more graphically, we chose the translation vector of the pose transformation to represent the pose hypothesis. The 3D pose parameter space is displayed in the 3D Cartesian coordinate system. The translations of the camera and objects have a real scale. The intermediate results of the segmentation of

W_{a}

and the tool are shown in Figure 4. Red circles and blue circles in each grayscale image represent the points on

W_{a}

and the tool, respectively. The black part in the grayscale image means the background has been removed. In the pose parameter space, we use shades of color to show the distribution of pose hypotheses. The darker the color (blue), the denser it is. The lighter the color (yellow), the sparser it is. It can be seen from Figure 4 that there are two obvious cluster centers, which fully reflect the existence of two objects with inconsistent motion. The 2D-3D point pairs of each object are acquired.

In our experiment, dynamic object segmentation is only performed at the beginning and when the number of points on the objects is less than a certain threshold. In other frames, the points on each object are separately tracked.

3.3. Modeling Results and Running Time

After object segmentation, the initial pose of each object can be acquired. Then, the factor graph optimization described in Section 2.2 is implemented to refine the poses of each object. The outputs are the camera poses and 3D models of the target object and dynamic objects, as shown in Figure 5. Usually, we take the object with the largest number of 2D-3D point pairs as object

O_{1}

. In our experiment, the target object has the largest number of point pairs, so it is object

O_{1}

, and camera poses are the poses of the target object point clouds.

In Figure 5, all results are represented in the reference coordinate system. The blue line denotes the trajectory of the camera. Each magenta pyramid denotes the 6DOF pose of each frame. For clearer presentation, we display the images of the 5th, 15th, 21th, and 32th frames captured by the camera. In each image, the green and blue marks mean the tracked feature points on the target object and tool, respectively. The 3D model of

W_{a}

is shown in the center of Figure 5, surrounded by the 3D points of the tool. The different colors of the tool’s 3D point clouds indicate the observations of the tool in different frames. With the movement of the target object and tool, occluded parts will be observed and modeled in other frames, as shown in the middle of the target object’s 3D model with different colors. The different colors on the 3D model correspond to the colors of the tool. It demonstrates that the occluded parts in other frames are observed in the current frame. It can be seen that the occluded parts are well modeled after optimization.

We also tested the running time of optimization. The experiments were performed on an Intel Core i7-8700 3.2 GHz desktop with 24 GB RAM. For three objects

W_{a}

,

W_{b}

,

W_{c}

to be modeled, we counted the time consumed by each important part of the program in each image frame, and calculated the average running time, as shown in the first four rows of Table 1. The last row of Table 1 gives the time for closed-loop optimization of all frames when there are closed-loop frames. In our experiment, we tracked 2000 points on the target object and 500 points on the moving tool. Local BA and global BA were implemented using the g2o [37] library.

3.4. Comparison Experiment (Accuracy Evaluation)

In order to evaluate the registration accuracy of the incomplete point cloud of each frame in the case of uncertain dynamic occlusion, we compared it with three methods. One was an incremental open-loop registering method. The method in [38] was utilized to calibrate the extrinsic parameters between the rotation axis and the camera. Then, point clouds were registered. The other two methods were the classic point cloud registration method Iterative Closest Point (ICP) [39] and an advanced method Correntropy-based Bidirectional Generalized Iterative Closest Point (CoBigICP) [40]. These three methods are good registration methods without moving objects. However, 3D modeling under uncertain dynamic occlusion cannot be solved. Specifically, we adopted the latter two methods to stitch point clouds, including dynamic objects. We found that they often did not converge or converged to wrong solutions due to the presence of dynamic objects, which indicated the necessity for uncertain dynamic object segmentation. Therefore, we first used our object segmentation method to segment the target object and moving tool in each frame, and then stitched the incomplete point clouds of the target object with the above three methods. Finally, the registration results of three objects

W_{a}

,

W_{b}

,

W_{c}

obtained using the three methods and our method were compared with their ground truths (obtained by a high-precision 3D camera with a nominal accuracy of 0.02 mm). The ground truths were a set of single-frame 2.5D point clouds captured by a high accuracy 3D camera from multiple perspectives. The model obtained by each method was aligned with the ground truths. Then, registration errors (root mean squared error of the Euclidean distance of each point in a 3D model to its nearest point in ground truth) and uncertainty (standard deviation) were calculated, as shown in Table 2, where the values in brackets are the corresponding standard deviations. It can be seen from the table that the performance of our method was better than the first three methods.

The pose measurement results for the target object were also evaluated. The initial entire 3D model was first acquired. Then, the target object was driven by the turntable to perform pre-defined relative pose transformations. After each pose transformation, the 3D model was acquired again at this transformed position and was registered with the initial entire 3D model to estimate the relative pose transformation. The pre-defined pose transformations could be regarded as ground truths of the pose transformations between two positions. Here, we provide three pre-defined relative pose transformations

T_{k} = [θ_{k}, t_{k}]

in the camera coordinate system where

k = 1, 2, 3

, as shown by the ground truths (GT) in Table 3, where

θ_{k} = [{{}^{k}θ}_{x}, {{}^{k}θ}_{y}, {{}^{k}θ}_{z}]

denotes the Euler’s angles about the

x y z

-axes (measured in degrees) and

t_{k} = [{{}^{k}t}_{x}, {{}^{k}t}_{y}, {{}^{k}t}_{z}]

means translation (measured in millimeters). These three pre-defined pose relative transformations

T_{1}

,

T_{2}

,

T_{3}

, are

T_{1} = [{0^{°}, 10}^{°}, 0^{°}, - 10, 0, 50]

,

T_{2} = [{0^{°}, - 10}^{°}, 0^{°}, 0, 0, - 50]

, and

T_{3} = [{0^{°}, 20}^{°}, 0^{°}, - 20, 0, 80]

, respectively. At each pre-defined pose, we performed multiple 3D modeling and measure relative pose transformations. In this experiment, our expected translation error and angle error were within 0.1 mm and 0.4 degrees, respectively. We selected the pose measurement results of

W_{a}

corresponding to pre-defined poses

T_{1}

and

T_{3}

, and showed the error of each measurement value compared to the ground truth in Figure 6. Figure 6a,b is the rotation and translation errors corresponding to

T_{1}

. Figure 6c,d is the rotation and translation errors corresponding to

T_{3}

. It can be seen that the angle errors are within 0.4 degrees, and translation errors are within 0.1mm, but there are a few results that exceed these values. Our method performed well. The pose measurement results of the three target objects

W_{a}

,

W_{b}

,

W_{c}

at pre-defined position transformed by

T_{k}

are given in the form of statistical values, as shown in Table 3. In Table 3,

\bar{θ_{k}}

and

σ_{θ_{k}}^{}

denote the average and standard deviation (uncertainty) of measured Euler angles.

{M e d}_{θ_{k}}

includes the median values of

θ_{x}, θ_{y}

, and

θ_{z}

in the measured results.

{|∆ θ_{k}|}_{m a x}

and

{|∆ θ_{k}|}_{m e a n}

are the maximum and mean of the absolute errors of

θ_{x}, θ_{y}

, and

θ_{z}

in the measured results, where

∆ θ_{k} = \tilde{θ_{k}} - θ_{k}^{g t}

,

\tilde{θ_{k}}

, and

θ_{k}^{g t}

are the measurement value and ground truth, respectively. Similarly,

\bar{t_{k}}

and

σ_{t_{k}}^{}

denote the average and standard deviation (uncertainty) of measured translations.

{M e d}_{t_{k}}

includes the median values of the

t_{x}, t_{y}

, and

t_{z}

in the measured results.

{|∆ t_{k}|}_{m a x}

and

{|∆ t_{k}|}_{m e a n}

are the maximum and mean of the absolute errors of

t_{x}, t_{y}

, and

t_{z}

in the measured results, where

∆ t_{k} = \tilde{t_{k}} - t_{k}^{g t}

,

\tilde{t_{k}}

, and

t_{k}^{g t}

are the measurement value and ground truth. As can be seen from Table 3, the average values of pose measurements are very close to the ground truths. Although the maximum absolute errors are large, it can be seen from the standard deviations, median values, and mean absolute errors that the measurement results are relatively stable.

In order to more vividly show the registration results of the incremental open-loop registering method, ICP, and our method, we demonstrate the results of the

W_{a}

in Figure 7. Figure 7a is the result of the incremental open-loop registering method. Due to accumulated errors, the 3D model does not close well. It can be seen there is a gap on the edge of the 3D model. The registration result of the ICP method is displayed in Figure 7b. ICP does not perform well. The possible reason is that

W_{a}

is a complex curved structure and the ICP method tends to converge to a local optimum. As can be seen from the figure, the accumulation of errors in the registering process makes it difficult for the later point clouds to register well with the previous point clouds. Our method performs well, as shown in Figure 7c.

4. Discussion

In this paper, we propose a segmentation method based on motion consistency constraints without any prior knowledge, which only achieves robust segmentation by analyzing the motion patterns of adjacent frames. The demonstration experiment of the clustering process fully reflects that pose hypotheses calculated from point pairs sampled on the same object must coincide in the pose parameter space. The number of clustering centers represents the number of possible dynamic objects with an inconsistent motion. Our method does not require object samples for training compared with deep learning-based methods, and it is insensitive to the size and motion of moving objects compared with the point cloud clustering-based and optical-flow-based methods. It can meet the needs of 3D modeling under uncertain dynamic occlusion. The experiments also further emphasize the importance of dynamic object segmentation. ICP and other methods are difficult to register point clouds without segmenting dynamic objects. For the registering of incomplete point clouds, we propose an optimization method based on local constraints and global loop closure constraints. Table 2 and Figure 7 show that the optimization method plays an important role in obtaining the entire 3D model. It can effectively improve the accuracy of 3D modeling, while it will increase the calculation and time cost. As the number of feature points increases, the calculation speed decreases. Finally, the pose measurement experiment shows that the rich geometric information of the entire 3D model makes the pose measurement more accurate and reflects the accuracy of 3D modeling results.

5. Conclusions

Aimed at the task of 3D modeling under uncertain dynamic occlusion, we propose an online 3D modeling method for pose measurements under uncertain dynamic occlusion. Firstly, in order to solve the uncertain dynamic object occlusion problem, we propose a dynamic object segmentation method based on motion consistency constraint without any prior knowledge about the objects, which segments the objects by random sampling and pose hypotheses clustering. Then, the initial pose and 2D-3D point pairs of each object between two adjacent frames can be obtained. After that, because the incomplete point cloud of the target object of each frame needs to be registered and inspired by the constraint relationships of the covisibility regions, an optimization method based on local constraints of overlapping view regions and global loop closure is introduced to optimize the registration of adjacent frames and the entire 3D model. Furthermore, constraints between global closed-loop frames are also established to refine the results. Finally, based on the proposed framework and method, we design and build a confirmatory experimental workspace. The experiments are carefully carried out in this workspace to test and verify the performance of our method. For the modeling results of three objects, our method outperforms three other classical methods. The 3D model is utilized to measure the poses, and the poses are further compared with the ground truths. The performance of our method meets the expected requirements.

In this paper, we mainly focus on improving the optimization speed and solving the object reflection problem. Optimization will slow down as the number of points increases. Graphics Processing Unit (GPU) acceleration and the filter-based method are considered to address this problem. Reflection is also a challenging problem in computer vision. At present, traditional methods cannot solve this problem well. We intend to introduce a learning-based technique to deal with this.

Author Contributions

Conceptualization, M.T. and J.Y.; methodology, X.G. and J.Y.; validation, X.G.; resources, J.Y. and M.T.; writing—original draft preparation, X.G.; writing—review and editing, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ge, J.; Li, J.; Peng, Y.; Lu, H.; Li, S.; Zhang, H.; Xiao, C.; Wang, Y. Online 3-D Modeling of Complex Workpieces for the Robotic Spray Painting with Low-cost RGB-D Cameras. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Y.; Xie, Q.; Wu, Q.; Guo, X.; Yu, Z.; Wang, J. Density-Invariant Registration of Multiple Scans for Aircraft Measurement. IEEE Trans. Instrum. Meas. 2020, 70, 1–15. [Google Scholar] [CrossRef]
Caruso, L.; Russo, R.; Savino, S. Microsoft Kinect V2 Vision System in A Manufacturing Application. Robot. Comput.-Integr. Manuf. 2017, 48, 174–181. [Google Scholar] [CrossRef]
Chai, X.; Wen, F.; Cao, X.; Yuan, K. A Fast 3D Surface Reconstruction Method for Spraying Robot with Time-of-Flight Camera. In Proceedings of the 2013 IEEE International Conference on Mechatronics and Automation, Takamatsu, Kagawa, Japan, 4–7 August 2013; pp. 57–62. [Google Scholar]
Xu, Y.; Nan, L.; Zhou, L.; Wang, J.; Wang, C.C. HRBF-Fusion: Accurate 3D Reconstruction from RGB-D Data Using on-the-fly Implicits. ACM Trans. Graph. 2022, 41, 35. [Google Scholar] [CrossRef]
Wang, Z.; Fan, J.; Jing, F.; Liu, Z. A Pose Estimation System Based on Deep Neural Network and ICP Registration for Robotic Spray Painting Application. Int. J. Adv. Manuf. Technol. 2019, 104, 285–299. [Google Scholar] [CrossRef]
Gao, H.; Ye, C.; Lin, W.; Qiu, J. Complex Workpiece Positioning System with Nonrigid Registration Method for 6-DoFs Automatic Spray Painting Robot. IEEE Trans. Syst. Man Cybern. Syst. 2020, 51, 7305–7313. [Google Scholar] [CrossRef]
Xie, Q.; Zhang, Y.; Cao, X.; Xu, Y.; Lu, D.; Chen, H.; Wang, J. Part-in-Whole Point Cloud Registration for Aircraft Partial Scan Automated Localization. Comput.-Aided Des. 2021, 137, 103042. [Google Scholar] [CrossRef]
Bi, Z.M.; Lang, S.Y.T. A Framework for CAD- and Sensor-based Robotic Coating Automation. IEEE Trans. Ind. Inform. 2007, 3, 84–91. [Google Scholar] [CrossRef]
Ruchay, A.N.; Dorofeev, K.A.; Kolpakov, V.I. Fusion of Information from Multiple Kinect Sensors for 3D Object Reconstruction. Comput. Opt. 2018, 42, 898–903. [Google Scholar] [CrossRef]
Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
Lin, C.Y.; Abebe, Z.A.; Chang, S.H. Advanced Spraying Task Strategy for Bicycle-Frame Based on Geometrical Data of Workpiece. In Proceedings of the 2015 International Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 277–282. [Google Scholar]
Zhang, T.; Zhang, H.; Li, Y. Flowfusion: Dynamic Dense RGB-D SLAM based on Optical Flow. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7322–7328. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.Y. PWC-NET: CNNs for Optical Flow using Pyramid, Warping, and Cost Volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8934–8943. [Google Scholar]
Scona, R.; Jaimez, M.; Petillot, Y.R. Staticfusion: Background Reconstruction for Dense RGB-D SLAM in Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3849–3856. [Google Scholar]
Jaimez, M.; Kerl, C.; Jimenez, J.G. Fast Odometry and Scene Flow from RGB-D Cameras based on Geometric Clustering. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3992–3999. [Google Scholar]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-means Clustering Algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Rünz, M.; Agapito, L. Co-fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4471–4478. [Google Scholar]
Xu, B.; Li, W.; Tzoumanikas, D. Mid-Fusion: Octree-based Object-level Multi-instance Dynamic SLAM. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5231–5237. [Google Scholar]
Zhang, T.; Nakamura, Y. Posefusion: Dense RGB-D SLAM in Dynamic Human Environments. In Proceedings of the 2018 International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2018; pp. 772–780. [Google Scholar]
Pinheiro, P.O.; Lin, T.Y.; Collobert, R. Learning to Refine Object Segments. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 75–91. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Runz, M.; Buffier, M.; Agapito, L. Maskfusion: Real-time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
Whelan, T.; Salas-Moreno, R.F.; Glocker, B. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. Int. J. Robot. Res. 2016, 35, 1697–1716. [Google Scholar] [CrossRef] [Green Version]
Cao, Z.; Hidalgo, G.; Simon, T. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bouguet, J.Y. Camera Calibration Toolbox for Matlab. Caltech. 2000. Available online: http://www.vision.caltech.edu/bouguetj/calibdoc/index.html (accessed on 5 November 2022).
Hartley, R.I. In Defense of the Eight-point Algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 580–593. [Google Scholar] [CrossRef] [Green Version]
Faugeras, O.D.; Lustman, F. Motion and Structure from Motion in A Piecewise Planar Environment. Int. J. Pattern Recognit. Artif. Intell. 1988, 2, 485–508. [Google Scholar] [CrossRef] [Green Version]
Xu, X.; Cheong, L.F.; Li, Z. Motion Segmentation by Exploiting Complementary Geometric Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2859–2867. [Google Scholar]
Lepetit, V.; Noguer, F.M.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef] [Green Version]
Rodriguez, A.; Laio, A. Clustering by Fast Search and Find of Density Peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [Green Version]
Chin, T.J.; Wang, H.; Suter, D. The Ordered Residual Kernel for Robust Motion Subspace Clustering. Adv. Neural Inf. Process. Syst. 2009, 22, 333–341. [Google Scholar]
Yamashita, N.; Fukushima, M. On the Rate of Convergence of the Levenberg-Marquardt Method. In Topics in Numerical Analysis: With Special Emphasis on Nonlinear Problems; Springer: Vienna, Austria, 2001; pp. 239–249. [Google Scholar]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef] [Green Version]
Bian, J.W.; Lin, W.Y.; Matsushita, Y.; Zhang, L.; Yeung, S.-K.; Cheng, M.-M. GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4181–4190. [Google Scholar]
Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
Grisetti, G.; Kümmerle, R.; Strasdat, H. g2o: A General Framework for (Hyper) Graph Optimization. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, China, 9–13 May 2011; pp. 9–13. [Google Scholar]
Ye, J.; Xia, G.; Liu, F.; Cheng, Q. 3D Reconstruction of Line-Structured Light Based on Binocular Vision Calibration Rotary Axis. Appl. Opt. 2020, 59, 8272–8278. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. Method for Registration of 3-D Shapes. Int. Soc. Opt. Photonics 1992, 1611, 586–606. [Google Scholar] [CrossRef] [Green Version]
Yin, P.; Wang, D.; Du, S. CoBigICP: Robust and Precise Point Set Registration using Correntropy Metrics and Bidirectional Correspondence. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020; pp. 4692–4699. [Google Scholar]

Figure 1. An illustration of our pipeline. (1) Vision system captures left and right images. Stereo rectification is first performed to align the epipolar lines of the left and right images by rotating two images. The depth of the points can be calculated by disparity. (2) When two consecutive frames arrive, data association is established by Grid-based Motion Statistics (GMS) and image block correlation. Then, object segmentation is carried out by random sampling and pose hypotheses clustering. (3) The segmentation results, which include initial poses and 2D-3D point pairs, are sent to the 3D modeling part. It includes the target object tracking module and the dynamic object tracking module. The target object poses (namely, camera pose) and the dynamic object motions will be estimated and optimized. (4) Finally, local BA and global BA are carried out to optimize and update these variables. The outputs of our pipeline are poses of target and dynamic objects, a 3D model, and 3D point cloud of dynamic objects.

Figure 2. Notation and coordinate system. The big black and green circle represent the 3D point of

O_{1}

and 3D point of

O_{l}

in the

O_{1}

coordinate. Small black and green circles denote the corresponding observations in images of adjacent frames. We establish the camera coordinate system in the camera optical center and the object coordinate system in the centroid of the object.

Figure 2. Notation and coordinate system. The big black and green circle represent the 3D point of

O_{1}

and 3D point of

O_{l}

in the

O_{1}

coordinate. Small black and green circles denote the corresponding observations in images of adjacent frames. We establish the camera coordinate system in the camera optical center and the object coordinate system in the centroid of the object.

Figure 3. The structure of the workspace. The vision system mainly included two symmetrically arranged high-resolution cameras and an auxiliary texture device in the middle. The workpiece was placed in the white slot on the turntable. The turntable and rotating shaft could change the pose of the object. Tools could be clamped by the end effector and be moved by the robotic arm.

Figure 4. The results of segmentation and clustering. In the grayscale images, the last and current frame represent two adjacent frames and the circles indicate the matched points. The circles with different colors indicate that the feature points are located on different objects. In the three-dimensional pose parameter space, the translation vectors with the real scale are demonstrated. The shade of the color means the density is different. These two cluster centers correspond to two objects with different motions.

Figure 5. The output of our pipeline. (1) The trajectory of the camera is represented by the blue line. The magenta pyramid denotes the pose of each frame. For a clearer presentation, the images show the pictures captured by the camera at frames 5, 15, 21, and 32. In each image, the green and blue marks mean the tracked feature points on the target object and moving tool, respectively. (2) 3D model. The blue point cloud in the center of the figure is the 3D model of

W_{a}

, surrounded by the moving tool. Here, different colors show the tool observed in each frame. With the rotation of the camera and the movement of the tool, the occluded parts will be observed and modeled in other frames, as shown in the middle of the 3D model with different colors. Here, different colors represent that the occluded parts of other frames are observed in the current frame.

Figure 5. The output of our pipeline. (1) The trajectory of the camera is represented by the blue line. The magenta pyramid denotes the pose of each frame. For a clearer presentation, the images show the pictures captured by the camera at frames 5, 15, 21, and 32. In each image, the green and blue marks mean the tracked feature points on the target object and moving tool, respectively. (2) 3D model. The blue point cloud in the center of the figure is the 3D model of

W_{a}

, surrounded by the moving tool. Here, different colors show the tool observed in each frame. With the rotation of the camera and the movement of the tool, the occluded parts will be observed and modeled in other frames, as shown in the middle of the 3D model with different colors. Here, different colors represent that the occluded parts of other frames are observed in the current frame.

Figure 6. Pose errors between measured poses and their corresponding ground truths consisting of rotation errors (along the x, y, z-axes in degrees) and translation errors (along the x, y, z-axes in millimeters). We chose the pose measurement results of

T_{1}

and

T_{3}

to show the error variations. (a,b) are the rotation and translation errors of measurement results of

T_{1}

, respectively. (c,d) are the rotation and translation errors of measurement results of

T_{3}

.

Figure 6. Pose errors between measured poses and their corresponding ground truths consisting of rotation errors (along the x, y, z-axes in degrees) and translation errors (along the x, y, z-axes in millimeters). We chose the pose measurement results of

T_{1}

and

T_{3}

to show the error variations. (a,b) are the rotation and translation errors of measurement results of

T_{1}

, respectively. (c,d) are the rotation and translation errors of measurement results of

T_{3}

.

Figure 7. 3D modeling results obtained by three methods. (a) is the result of the incremental open-loop registering method. The object structure cannot be recovered in (b) because of the disadvantages of ICP. Our method performs well in (c).

Table 1. Runtime (millisecond) statistics for important parts of 3D modeling processes for three objects

W_{a}

,

W_{b}

,

W_{c}

.

Table 1. Runtime (millisecond) statistics for important parts of 3D modeling processes for three objects

W_{a}

,

W_{b}

,

W_{c}

.

	$W_{a}$	$W_{b}$	$W_{c}$
Camera Tracking and Pose Optimization	16.32	40.46	33.83
Dynamic Object Tracking	1.27	1.31	1.07
Dynamic Object Pose Estimation and Optimization	19.34	17.10	8.90
Update	11.96	18.11	22.69
Local Bundle Adjustment (In Sliding Window)	21.77	37.79	55.23
Closed-loop Optimization (All Frames from the Loop)	855.72	1432.44	2060.28

Table 2. Registration error (mm) and uncertainty (standard deviation, in brackets) of three objects

W_{a}

,

W_{b}

,

W_{c}

compared with three registration methods.

Table 2. Registration error (mm) and uncertainty (standard deviation, in brackets) of three objects

W_{a}

,

W_{b}

,

W_{c}

compared with three registration methods.

	$W_{a}$	$W_{b}$	$W_{c}$
Incremental Open-loop Registering Method	0.22 (7.81 × 10⁻²)	0.35 (0.27)	0.18 (7.55 × 10⁻²)
ICP	0.54 (0.27)	0.32 (0.21)	0.19 (9.70 × 10⁻²)
CoBigICP [40]	0.13 (4.58 × 10⁻²)	0.18 (5.39 × 10⁻²)	0.11 (5.74 × 10⁻²)
Ours	0.06 (3.03 × 10⁻²)	0.07 (2.70 × 10⁻²)	0.08 (2.96 × 10⁻²)

Table 3. Averages, standard deviations, median values, maximum absolute errors, and mean absolute errors of pose measurement results in the camera coordinate system.

T_{1}

,

T_{2}

,

T_{3}

are three pre-defined relative pose transformations. The 3D modeling is carried out at the initial position and the positions transformed by these three pre-defined relative pose transformation matrixes. Pose transformations between two positions can be measured by these 3D models and compared with the ground truths

T_{1}

,

T_{2}

,

T_{3}

.

Table 3. Averages, standard deviations, median values, maximum absolute errors, and mean absolute errors of pose measurement results in the camera coordinate system.

T_{1}

,

T_{2}

,

T_{3}

are three pre-defined relative pose transformations. The 3D modeling is carried out at the initial position and the positions transformed by these three pre-defined relative pose transformation matrixes. Pose transformations between two positions can be measured by these 3D models and compared with the ground truths

T_{1}

,

T_{2}

,

T_{3}

.

Measured Poses		$W_{a}$	$W_{b}$	$W_{c}$	GT
$T_{1}$	$\bar{θ_{1}}$	$({- 0.03}^{°}, {9.99}^{°}, - {0.03}^{°})$	$({- 0.06}^{°}, {10.03}^{°}, - {0.01}^{°})$	$({{- 0.02}^{°}, 9.97}^{°}, {- 0.03}^{°})$	$({0^{°}, 10}^{°}, 0^{°})$
	$σ_{θ_{1}}^{}$	(0.11, 0.19, 0.09)	(0.19, 0.15, 0.19)	(0.10, 0.20, 0.09)	-
	${M e d}_{θ_{1}}$	$({- 0.03}^{°}, {9.98}^{°}, - {0.03}^{°})$	$({- 0.07}^{°}, {10.02}^{°}, - {0.06}^{°})$	$({{- 0.02}^{°}, 9.97}^{°}, {- 0.02}^{°})$	-
	${\|∆ θ_{1}\|}_{m a x}$	$({0.35}^{°}, {0.37}^{°}, {0.26}^{°})$	$({0.85}^{°}, {0.73}^{°}, {0.49}^{°})$	$({{0.31}^{°}, 0.45}^{°}, {0.26}^{°})$	-
	${\|∆ θ_{1}\|}_{m e a n}$	$({0.09}^{°}, {0.15}^{°}, {0.07}^{°})$	$({0.13}^{°}, {0.09}^{°}, {0.15}^{°})$	$({{0.08}^{°}, 0.16}^{°}, {0.07}^{°})$	-
	$\bar{t_{1}}$	(−10.00, 0.01, 50.01)	(−9.99, 0.01, 50.02)	(−9.98, 0.02, 50.01)	(−10, 0, 50)
	$σ_{t_{1}}^{}$	(0.05, 0.04, 0.04)	(0.05, 0.05, 0.04)	(0.05, 0.05, 0.06)	-
	${M e d}_{t_{1}}$	(−10.01, 0.001, 50.01)	(−9.97, 0.02, 50.03)	(−9.99, 0.01, 50.03)	-
	${\|∆ t_{1}\|}_{m a x}$	(0.10, 0.11, 0.11)	(0.16, 0.12, 0.11)	(0.15, 0.13, 0.14)	-
	${\|∆ t_{1}\|}_{m e a n}$	(0.04, 0.03, 0.04)	(0.04, 0.05, 0.03)	(0.04, 0.04, 0.05)	-
$T_{2}$	$\bar{θ_{2}}$	$({{- 0.01}^{°}, - 9.96}^{°}, {0.01}^{°})$	$({{0.02}^{°}, - 9.98}^{°}, {0.04}^{°})$	$({0.01}^{°}, {- 9.98}^{°}, {0.04}^{°})$	$({0^{°}, - 10}^{°}, 0^{°})$
	$σ_{θ_{2}}^{}$	(0.12, 0.11, 0.19)	(0.11, 0.21, 0.09)	(0.12, 0.22, 0.09)	-
	${M e d}_{θ_{2}}$	$({{- 0.03}^{°}, - 9.95}^{°}, {0.07}^{°})$	$({{0.01}^{°}, - 9.97}^{°}, {0.03}^{°})$	$({0.03}^{°}, {- 9.96}^{°}, {0.03}^{°})$	-
	${\|∆ θ_{2}\|}_{m a x}$	$({{0.44}^{°}, 0.31}^{°}, {0.52}^{°})$	$({{0.31}^{°}, 0.41}^{°}, {0.26}^{°})$	$({0.29}^{°}, {0.64}^{°}, {0.24}^{°})$	-
	${\|∆ θ_{2}\|}_{m e a n}$	$({{0.08}^{°}, 0.09}^{°}, {0.14}^{°})$	$({{0.08}^{°}, 0.16}^{°}, {0.07}^{°})$	$({0.09}^{°}, {0.17}^{°}, {0.07}^{°})$	-
	$\bar{t_{2}}$	(0.01, 0.03, −49.97)	$(0.01, - 0.01, - 49.98)$	$(- 0.01, - 0.02, - 50.02)$	(0, 0, −50)
	$σ_{t_{2}}^{}$	(0.04, 0.05, 0.05)	(0.05, 0.04, 0.07)	(0.04, 0.04, 0.06)	-
	${M e d}_{t_{2}}$	(0.01, 0.02, −49.98)	(0.03, 0.01, −49.97)	$(- 0.01, - 0.01, - 50.01)$	-
	${\|∆ t_{2}\|}_{m a x}$	(0.09, 0.16, 0.15)	(0.15, 0.12, 0.15)	(0.10, 0.10, 0.11)	-
	${\|∆ t_{2}\|}_{m e a n}$	(0.03, 0.04, 0.05)	(0.05, 0.03, 0.04)	(0.03, 0.04, 0.06)	-
$T_{3}$	$\bar{θ_{3}}$	$(- {{0.07}^{°}, 19.97}^{°}, - {0.06}^{°})$	$({- 0.12}^{°}, {20.05}^{°}, {0.01}^{°})$	$({- 0.05}^{°}, {19.95}^{°}, {- 0.07}^{°})$	$({0^{°}, 20}^{°}, 0^{°})$
	$σ_{θ_{3}}^{}$	(0.16, 0.31, 0.12)	(0.23, 0.18, 0.33)	(0.16, 0.23, 0.11)	-
	${M e d}_{θ_{3}}$	$({{- 0.12}^{°}, 19.91}^{°}, {- 0.06}^{°})$	$({- 0.15}^{°}, {20.03}^{°}, {- 0.12}^{°})$	$({- 0.06}^{°}, {19.90}^{°}, {- 0.07}^{°})$	-
	${\|∆ θ_{3}\|}_{m a x}$	$({{0.38}^{°}, 0.69}^{°}, {0.35}^{°})$	$({0.61}^{°}, {0.88}^{°}, {0.94}^{°})$	$({0.36}^{°}, {0.77}^{°}, {0.35}^{°})$	-
	${\|∆ θ_{3}\|}_{m e a n}$	$({{0.15}^{°}, 0.26}^{°}, {0.10}^{°})$	$({0.20}^{°}, {0.12}^{°}, {0.26}^{°})$	$({0.13}^{°}, {0.27}^{°}, {0.10}^{°})$	-
	$\bar{t_{3}}$	(−19.99, 0.02, 80.01)	(−20.01, −0.01, 79.99)	(−20.02, −0.02, 79.99)	(−20, 0, 80)
	$σ_{t_{3}}^{}$	(0.06, 0.04, 0.05)	(0.05, 0.04, 0.04)	(0.05, 0.03, 0.04)	-
	${M e d}_{t_{3}}$	(−19.98, 0.02, 79.99)	(−20.01, −0.02, 80.00)	(−20.02, −0.02, 79.98)	-
	${\|∆ t_{3}\|}_{m a x}$	(0.10, 0.12, 0.14)	(0.11, 0.09, 0.13)	(0.12, 0.10, 0.09)	-
	${\|∆ t_{3}\|}_{m e a n}$	(0.05, 0.04, 0.07)	(0.04, 0.04, 0.06)	(0.04, 0.05, 0.03)	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Yu, J.; Tan, M. An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera. Sensors 2023, 23, 2871. https://doi.org/10.3390/s23052871

AMA Style

Gao X, Yu J, Tan M. An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera. Sensors. 2023; 23(5):2871. https://doi.org/10.3390/s23052871

Chicago/Turabian Style

Gao, Xuanchang, Junzhi Yu, and Min Tan. 2023. "An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera" Sensors 23, no. 5: 2871. https://doi.org/10.3390/s23052871

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Online 3D Modeling Method for Pose Measurement under Uncertain Dynamic Occlusion Based on Binocular Camera

Abstract

1. Introduction

2. Methods

2.1. Object Segmentation by Clustering

2.2. 3D Modeling Based on Graph Optimization

3. Results

3.1. Introduction of Confirmatory Experimental Workspace

3.2. Demonstration of the Clustering Process

3.3. Modeling Results and Running Time

3.4. Comparison Experiment (Accuracy Evaluation)

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI