CoFiI2P: Coarse-to-Fine Correspondences-Based Image to Point Cloud Registration

Shuhao Kang*, Youqi Liao*, Jianping Li, , Fuxun Liang, Yuhao Li, Xianghong Zou,
Fangning Li, Xieyuanli Chen, Zhen Dong, , Bisheng Yang
Manuscript received: May. 14, 2024; Revised: Aug. 8, 2024; Accepted: Sep. 9, 2024. This paper was recommended for publication by Editor Cesar Cadena Lerma upon evaluation of the Associate Editor and Reviewers’ comments. Digital Object Identifier (DOI): see top of this page. This study was jointly supported by the National Natural Science Foundation Project (No. 42201477, No. 42130105), the Open Fund of Hubei Luojia Laboratory (No. 2201000054) and Open Fund of Key Laboratory of Urban Spatial Information, Ministry of Natural Resources (Grant No. 2023ZD001). (Shuhao Kang and Youqi Liao are co-first authors and contribute equally to the paper) (Corresponding author: Jianping Li) S. Kang is with the Technical University of Munich, Germany. Y. Liao is with the Wuhan University and Hubei Luojia Laboratory, China. Z. Dong, F. Liang, Y. Li, X. Zou and B. Yang are with the Wuhan University, China. J. Li is with the Nanyang Technological University, Singapore. F. Li is with the Beijing Urban Construction Exploration and Surveying Design Research Institute Co. Ltd. X. Chen is with the National University of Defense Technology, China.
Abstract

Image-to-point cloud (I2P) registration is a fundamental task for robots and autonomous vehicles to achieve cross-modality data fusion and localization. Current I2P registration methods primarily focus on estimating correspondences at the point or pixel level, often neglecting global alignment. As a result, I2P matching can easily converge to a local optimum if it lacks high-level guidance from global constraints. To improve the success rate and general robustness, this paper introduces CoFiI2P, a novel I2P registration network that extracts correspondences in a coarse-to-fine manner. First, the image and point cloud data are processed through a two-stream encoder-decoder network for hierarchical feature extraction. Second, a coarse-to-fine matching module is designed to leverage these features and establish robust feature correspondences. Specifically, in the coarse matching phase, a novel I2P transformer module is employed to capture both homogeneous and heterogeneous global information from the image and point cloud data. This enables the estimation of coarse super-point/super-pixel matching pairs with discriminative descriptors. In the fine matching module, point/pixel pairs are established with the guidance of super-point/super-pixel correspondences. Finally, based on matching pairs, the transformation matrix is estimated with the EPnP-RANSAC algorithm. Experiments conducted on the KITTI Odometry dataset demonstrate that CoFiI2P achieves impressive results, with a relative rotation error (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29 meters, while maintaining real-time speed. These results represent a significant improvement of 84% in RRE and 89% in RTE compared to the current state-of-the-art (SOTA) method. Additional experiments on the Nuscenes dataset confirm our method’s generalizability. The project page is available at https://whu-usi3dv.github.io/CoFiI2P.

Index Terms— Image-to-Point Cloud Registration, Coarse-to-Fine Correspondences, Transformer Network

Refer to caption
Figure 1: Comparison of existing one-stage I2P registration and proposed coarse-to-fine I2P registration. (a) The existing one-stage registration pipeline. The matching pairs are directly established at the point/pixel level, leading to a significant number of mismatches. (b) Our coarse-to-fine matching pipeline. Under the guidance of super point-to-pixel pairs, point-to-pixel pairs are generated from the existing super pairs, which effectively eliminates most mismatches.

I Introduction

Estimating the six degrees of freedom (6-DoF) pose of a monocular image relative to a pre-built point cloud map is a fundamental requirement for robots and autonomous vehicles  [1, 2, 3]. Due to the limited onboard resources, robots are often equipped with only a monocular camera while facing challenges related to scale ambiguity in absolute localization and depth sensing. Establishing an accurate pose transformation between the image coordinate system and the pre-built point cloud coordinate system is crucial. This transformation not only precisely localizes the robot but also effectively reduces the scale uncertainty inherent in the monocular data [4].

However, cross-modality registration has its inherent challenges. Some existing methods use hand-crafted detectors and descriptors for I2P registration [5, 6]. These approaches rely on structured features like edge features [6], which are limited by specific environmental conditions. With the rapid development of deep learning (DL), learning-based I2P registration approaches [7, 8, 9] have been proposed to extract representative keypoints and descriptors. 2D3D-MatchNet [7] proposed a novel Siamese network to learn the cross-modality descriptor but the manually designed detectors matched poorly. To alleviate the difficulty of correspondences construction and improve the registration success rate, DeepI2P [8] converts the registration problem to the classification problem. A novel binary classification network is designed to distinguish whether the projected points are within or beyond the camera frustum. The classification results are passed into an inverse camera projection solver to estimate the transformation between the camera and laser scanners. As a large number of points on the boundary are misclassified, the accuracy of the camera pose is still limited. CorrI2P [9] proposed an overlap region detector for both image and point cloud, then pixels and points in the overlap region are matched to obtain I2P correspondences. Feature fusion module is exploited to fuse the point cloud and image information. Although CorrI2P [9] has significant improvement over DeepI2P [8], matching merely on one stage, namely, the pixel-point level without global alignment guidance, can lead to local minima and instability.

Inspired by recent coarse-to-fine matching schedules and transformers in image-to-image (I2I) registration [10, 11] and point cloud-to-point cloud (P2P) registration approaches [12], this paper proposes the Coarse-to-Fine Image-to-Point cloud (CoFiI2P) network for I2P registration. The I2P transformer with self- and cross-attention modules is embedded into the network for global alignment. Overall, the main contributions of this work are as follows.

  1. 1.

    A novel coarse-to-fine I2P registration network is proposed to align image and point cloud in a progressive way. The coarse matching step provides rough but robust super-point/super-pixel correspondences for the following fine matching step, which filters out most mismatched pairs and reduces the computation burden. The fine matching step achieves accurate and reliable point/pixel correspondences with the global guidance.

  2. 2.

    A novel I2P transformer that incorporates both self-attention and cross-attention modules is proposed to enhance its global-aware capabilities in homogeneous and heterogeneous data. The self-attention module enables the capture of spatial context within the same modality data, while the cross-attention module facilitates the extraction of hybrid features from both the image and point cloud data.

II Related work

II-A Same-modality Registration

II-A1 I2I Registration

Before the age of deep-learning, hand-crafted detectors and descriptors (i.e., SIFT [13] and ORB [14]) are widely used to extract correspondences for matching. Compared to traditional methods, learning-based methods improve the robustness and accuracy of image matching with large viewpoint differences and illumination changes. SuperPoint [15] proposed a self-learning training method through homography adaptation. SuperGlue [16] proposed an attention-based graph neural network (GNN) for feature matching. Patch2Pix [17] is the first work to obtain patch matches and regress pixel-wise matches in a coarse-to-fine manner.

II-A2 P2P Registration

Point cloud registration aims to estimate the optimal rigid transformation of two point clouds. Correspondence-based methods [18, 19] estimate the correspondences first and recover the transformation with robust estimation methods [20]. RoReg [21] embeds orientation information of point cloud to estimate local orientation and refine coarse rotation through residual regression to achieve fine registration. In the transformer era, point-based transformers [22, 12] have emerged and shown great performance. CoFiNet [22] followed LoFTR’s [10] design and proposed the coarse-to-fine correspondences for registration, which computed coarse matches with descriptors strengthened by the transformer and refined coarse matches through density adaptive matching module. GeoTransformer [12] uses the Transformer to fully exploit the 3D properties of point clouds. However, I2I or P2P registration methods can not be directly used for the I2P registration, which extracts heterogeneous features from cross-modality data.

II-B Cross-modality Registration

To address the cross-modality registration problem, a variety of I2P registration methods have been proposed, which could be roughly divided into two categories: I2P fine registration (initial transformation dependent) and I2P coarse registration (initial transformation free). The I2P fine registration methods [23, 5, 24] rely on initial transform parameters and are widely applied in sensor calibration. Although this paper focuses on the second category, namely coarse I2P registration without any initial transformation knowledge, we provide a comprehensive review of registration methods in both categories.

II-B1 Fine Registration Methods

Fine registration methods have been thoroughly studied for several decades. Early-stage works [25, 26] utilize various artificial targets as calibration constraints. In recent years, some approaches have argued that structured features shared in images and point cloud could be used for target-less I2P registration, e.g., edge feature [23] and line feature [5][23] employs edge information and optimizes transformation according to the response value. [5] extracts lines from image and point clouds for registration. Recently, researches on 2D and 3D semantic segmentation motivate semantic feature-based I2P registration. Se-calib [24] further studies semantic edges in I2P registration to employ more common semantic information instead of a certain type of object. Overall, fine registration methods achieve high accuracy but depend on the quality of the initial values.

II-B2 Coarse Registration Methods

2D3D-Matchnet [7] is the pioneering method to regress the relative transform parameters with CNN. It extracts SIFT [13] and ISS [27] keypoints from image and point cloud and learns descriptors with a Siamese network. Experimental results show that hand-crafted detectors from different modalities match poorly. HAS-Net [28] proposed a novel network for learning cross-domain descriptors from 2D image patches and 3D point cloud volumes. However, both 2D3D-Matchnet and HAS-Net split the image and point cloud into patches and volumes, and then match with the descriptors of patches and volumes, resulting in the loss of long-range context and high-level information. DeepI2P [8] proposed a feature fusion module to merge image and point cloud information and classified points in/beyond the camera frustum. CorrI2P [9] predicted pixels and points in overlapping areas and matched with dense per-pixel/per-point features directly to get I2P correspondences. Although overlapping region detectors significantly reduce the number of false candidates, I2P registration only using the low-level feature without global guidance leads to serious mismatches. Inspired by the coarse-to-fine strategy in CoFiNet [22], we propose the CoFiI2P network, a coarse-to-fine I2P registration approach that integrates high-level correspondence information into low-level matching to effectively reject mismatches.

III Methodology

Refer to caption
Figure 2: Workflow of CoFiI2P. The proposed method consists of feature extraction, coarse matching, fine matching and pose estimation modules. Image and point cloud are sent to the feature extraction module to obtain coarse-level features and fine-level features (rendered in red for image and green for point cloud, respectively). The coarse-level features are strengthened by I2P transformer module and then matched with the cosine similarity. Fine features are gathered from the last layer of the decoder. In each super-point/super-pixel pair, the node point is set as the candidate and the corresponding pixel is selected from the super-pixel area, a w×w𝑤𝑤w\times witalic_w × italic_w window. The generated fine-level matching pairs are utilized to estimate the pose with the EPnP-RANSAC [29, 20] algorithm.

For convenient description, a pair of partially overlapped image and point cloud are defined as IW×H×3Isuperscript𝑊𝐻3\mathrm{I}\in\mathbb{R}^{W\times H\times 3}roman_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H × 3 end_POSTSUPERSCRIPT and PN×3Psuperscript𝑁3\mathrm{P}\in\mathbb{R}^{N\times 3}roman_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT, where W𝑊Witalic_W and H𝐻Hitalic_H are the width and height, and N𝑁Nitalic_N is the number of points. The purpose of I2P registration is to estimate the relative transformation between the image II\mathrm{I}roman_I and point cloud PP\mathrm{P}roman_P, defined by a rotation matrix 𝐑3×3𝐑superscript33\mathbf{R}\in\mathbb{R}^{3\times 3}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and a translation vector 𝐭3𝐭superscript3\mathbf{t}\in\mathbb{R}^{3}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

Our method adopts the coarse-to-fine manner to find the correct correspondences set Ω(I,P)ΩIP\Omega(\mathrm{I},\mathrm{P})roman_Ω ( roman_I , roman_P ) and calculate the relative pose using EPnP-RANSAC [29]. The CoFiI2P mainly consists of four modules: feature extraction (FE), coarse matching (CM), fine matching (FM), and pose estimation (PE). FE is an encoder-decoder structure network, that encodes raw inputs from different modalities into shared feature space and finds in-frustum super-points. CM and FM are cascaded two-stage matching modules. CM constructs coarse matching pairs at super-pixel/super-point level, and them FM constructs fine matching pairs at pixel/point level sequentially with the guidance of super-pixel/super-point correspondences. Lastly, the PE module exploits point-pixel matching pairs to regress the relative pose with the EPnP-RANSAC algorithm. The workflow of the proposed method is shown in Fig. 2.

III-A Feature Extraction

We utilize ResNet-34 [30] and KPConv-FPN [31] as the backbones for image and point cloud to extract multi-level features. The encoder progressively embeds raw inputs into high-dimensional features, and the decoder propagates high-level features to low-level features with skip-connection for per-pixel/point feature generation. Specifically, points and pixels at the coarsest resolution (1818\frac{1}{8}divide start_ARG 1 end_ARG start_ARG 8 end_ARG of original resolution) P~~𝑃{\tilde{P}}over~ start_ARG italic_P end_ARG and I~~𝐼{\tilde{I}}over~ start_ARG italic_I end_ARG are treated as superpoints and superpixels for coarse matching, and P𝑃{P}italic_P and I𝐼{I}italic_I at the finest resolution (1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG of original resolution) are used for fine matching. We use 𝑭P~subscript𝑭~𝑃\boldsymbol{F}_{\tilde{P}}bold_italic_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT and 𝑰P~subscript𝑰~𝑃\boldsymbol{I}_{\tilde{P}}bold_italic_I start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT to denote coarse-level features, and 𝑭Psubscript𝑭𝑃\boldsymbol{F}_{P}bold_italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and 𝑰Psubscript𝑰𝑃\boldsymbol{I}_{P}bold_italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT for fine-level features. For each superpoint, we construct the local points group Gp~subscript𝐺~𝑝G_{\tilde{p}}italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT with point-to-node strategy in the geometric space:

Gp~={pP|pp~<rg},G_{\tilde{p}}=\{p\in P\lvert\|p-\tilde{p}\|<r_{g}\},italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT = { italic_p ∈ italic_P | ∥ italic_p - over~ start_ARG italic_p end_ARG ∥ < italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } , (1)

where rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the chosen radius. For the superpixels, we first locate their positions on the fine-level feature map, and then crop sets of local pixel patches with window size w×w𝑤𝑤w\times witalic_w × italic_w.

III-B I2P Coarse Matching

In the CM module, I2P transformer is utilized to capture the geometric and spatial consistency between image and point cloud. Each stage of the I2P transformer consists of a self-attention block for inter-modality long-range context and a cross-attention module for intra-modality feature exchange. The self-attention and cross-attention modules are repeated several times to extract well-mixed features for super-point/super-pixel correspondence matching.

III-B1 I2P Transformer

[32, 33] have shown that vision transformer (ViT) outperforms traditional CNN-based methods with a large margin in classification, detection, segmentation and other downstream tasks. Furthermore, recent approaches [11, 10, 34] have introduced transformer modules for I2I and P2P registration tasks. Therefore, we introduce the I2P transformer module customized for cross-modality registration task to enhance the representability and robustness of descriptors. Different from ViT used in the same-modality registration tasks, our I2P transformer contains both self-attention modules for space context capturing in homogeneous data and cross-attention modules for hybrid feature integration among heterogeneous data.

For the self-attention module, given a coarse-level feature map 𝐅N×C𝐅superscript𝑁𝐶\mathbf{F}\in\mathbb{R}^{N\times C}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT of image or point cloud, the query, key and value vectors 𝐪,𝐤,𝐯𝐪𝐤𝐯\mathbf{q},\mathbf{k},\mathbf{v}bold_q , bold_k , bold_v are generated as:

𝐪=𝐖𝐪𝐅,𝐤=𝐖𝐤𝐅,𝐯=𝐖𝐯𝐅,formulae-sequence𝐪subscript𝐖𝐪𝐅formulae-sequence𝐤subscript𝐖𝐤𝐅𝐯subscript𝐖𝐯𝐅\mathbf{q}=\mathbf{W_{q}F},\mathbf{k}=\mathbf{W_{k}F},\mathbf{v}=\mathbf{W_{v}% F},bold_q = bold_W start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT bold_F , bold_k = bold_W start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT bold_F , bold_v = bold_W start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT bold_F , (2)

where 𝐖q,𝐖kCqksa×C,𝐖vCvsa×Cformulae-sequencesubscript𝐖𝑞subscript𝐖𝑘superscriptsuperscriptsubscript𝐶𝑞𝑘𝑠𝑎𝐶subscript𝐖𝑣superscriptsuperscriptsubscript𝐶𝑣𝑠𝑎𝐶\mathbf{W}_{q},\mathbf{W}_{k}\in\mathbb{R}^{C_{qk}^{sa}\times C},\mathbf{W}_{v% }\in\mathbb{R}^{C_{v}^{sa}\times C}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_q italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT are learnable weight matrixs. Then, the global attention enhanced feature map 𝐅sasuperscript𝐅𝑠𝑎\mathbf{F}^{sa}bold_F start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT is calculated as:

𝐅sa=softmax(𝐪𝐤C)𝐯.superscript𝐅𝑠𝑎softmaxsuperscript𝐪𝐤top𝐶𝐯\mathbf{F}^{sa}=\mathrm{softmax}(\frac{\mathbf{q}\mathbf{k}^{\top}}{\sqrt{C}})% \mathbf{v}.bold_F start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT = roman_softmax ( divide start_ARG bold_qk start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) bold_v . (3)

Softmax operates row-wise on the attention matrix 𝐪𝐤superscript𝐪𝐤top\mathbf{q}\mathbf{k}^{\top}bold_qk start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to obtain the weights on the values. Extracted global-aware features 𝐅sasuperscript𝐅𝑠𝑎\mathbf{F}^{sa}bold_F start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT are fed into the feed-forward network (FFN) to fuse the spatial relation information in channel dimension. Given a feature map 𝐅𝐅\mathbf{F}bold_F, the relative positions are encoded with multi-layer perception (MLP) [35].

Cross-attention is designed for fusing image and point cloud features in the I2P registration task. Given the feature maps 𝐅P~,𝐅I~subscript𝐅~𝑃subscript𝐅~𝐼\mathbf{F}_{\tilde{P}},\mathbf{F}_{\tilde{I}}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT for super-points set P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG and super-pixels set I~~𝐼\tilde{I}over~ start_ARG italic_I end_ARG, cross-attention enhanced feature maps 𝐅P~casuperscriptsubscript𝐅~𝑃𝑐𝑎\mathbf{F}_{\tilde{P}}^{ca}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT of point cloud and 𝐅I~casuperscriptsubscript𝐅~𝐼𝑐𝑎\mathbf{F}_{\tilde{I}}^{ca}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT of image are calculated as:

𝐅P~ca=softmax(𝐪P~𝐤I~C)𝐯P~,𝐅I~ca=softmax(𝐪I~𝐤P~C)𝐯I~,formulae-sequencesuperscriptsubscript𝐅~𝑃𝑐𝑎softmaxsubscript𝐪~𝑃superscriptsubscript𝐤~𝐼top𝐶subscript𝐯~𝑃superscriptsubscript𝐅~𝐼𝑐𝑎softmaxsubscript𝐪~𝐼superscriptsubscript𝐤~𝑃top𝐶subscript𝐯~𝐼\begin{split}&\mathbf{F}_{\tilde{P}}^{ca}=\mathrm{softmax}(\frac{\mathbf{q}_{% \tilde{P}}\mathbf{k}_{\tilde{I}}^{\top}}{\sqrt{C}})\mathbf{v}_{\tilde{P}},\\ &\mathbf{F}_{\tilde{I}}^{ca}=\mathrm{softmax}(\frac{\mathbf{q}_{\tilde{I}}% \mathbf{k}_{\tilde{P}}^{\top}}{\sqrt{C}})\mathbf{v}_{\tilde{I}},\end{split}start_ROW start_CELL end_CELL start_CELL bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT = roman_softmax ( divide start_ARG bold_q start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT bold_k start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT = roman_softmax ( divide start_ARG bold_q start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT bold_k start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT , end_CELL end_ROW (4)

where 𝐪P~,𝐤P~,𝐯P~subscript𝐪~𝑃subscript𝐤~𝑃subscript𝐯~𝑃\mathbf{q}_{\tilde{P}},\mathbf{k}_{\tilde{P}},\mathbf{v}_{\tilde{P}}bold_q start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT are query, key and value vectors of point cloud feature 𝐅P~subscript𝐅~𝑃\mathbf{F}_{\tilde{P}}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT, and 𝐪I~,𝐤I~,𝐯I~subscript𝐪~𝐼subscript𝐤~𝐼subscript𝐯~𝐼\mathbf{q}_{\tilde{I}},\mathbf{k}_{\tilde{I}},\mathbf{v}_{\tilde{I}}bold_q start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT are query, key and value vectors of image feature 𝐅I~subscript𝐅~𝐼\mathbf{F}_{\tilde{I}}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT. The softmax operation is the same as in the self-attention module.

Remark 1. While the self-attention module encodes the spatial and geometric features for each super-pixel and super-point, the cross-attention module injects the geometric structure information and texture information across image and point cloud respectively. Outputs of the I2P transformer carry powerful cross-modality information for matching.

III-B2 Super-point/-pixel Matching

The field of view (FoV) of the monocular camera is obviously smaller than the laser scans of 3D Lidar (e.g., Velodyne-H64), which sweeps 360 degrees in the horizontal direction. Therefore, only a small number of super-points are in the camera frustum. To filter out beyond-frustum super-points, we add a simple binary classification head to predict super-points in or beyond the frustum of the camera. The coarse level correspondences are estimated between in-frustum super-points set P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG and super-pixels set I~~𝐼\tilde{I}over~ start_ARG italic_I end_ARG by finding the nearest super-pixel i~~𝑖\tilde{i}over~ start_ARG italic_i end_ARG in feature space:

Ω~={(p~xP~,i~yI~)|y=argmin|I~|𝐅P~(p~x)𝐅I~(i~y)},\tilde{\Omega}_{\mathcal{M}}=\{(\tilde{p}_{x}\in\tilde{P},\tilde{i}_{y}\in% \tilde{I})\lvert y=\mathop{\mathrm{argmin}}\limits_{\lvert\tilde{I}\rvert}\|% \mathbf{F}_{\tilde{P}}(\tilde{p}_{x})-\mathbf{F}_{\tilde{I}}(\tilde{i}_{y})\|\},over~ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = { ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ over~ start_ARG italic_P end_ARG , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ over~ start_ARG italic_I end_ARG ) | italic_y = roman_argmin start_POSTSUBSCRIPT | over~ start_ARG italic_I end_ARG | end_POSTSUBSCRIPT ∥ bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) - bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ∥ } , (5)

where 𝐅P~subscript𝐅~𝑃\mathbf{F}_{\tilde{P}}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT and 𝐅I~subscript𝐅~𝐼\mathbf{F}_{\tilde{I}}bold_F start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT are corresponding feature maps.

III-C I2P Fine Matching

The first-stage matching at the coarse level constructs robust super-point/super-pixel pairs Ω~subscript~Ω\tilde{\Omega}_{\mathcal{M}}over~ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT but leads to poor registration accuracy. To obtain high-quality I2P correspondences, we generate fine correspondences based on the coarse matching results. In each super-point/super-pixel correspondence (p~x,i~y)subscript~𝑝𝑥subscript~𝑖𝑦(\tilde{p}_{x},\tilde{i}_{y})( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ), the super-point p~xsubscript~𝑝𝑥\tilde{p}_{x}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is expanded to a local points group Gp~xsubscript𝐺subscript~𝑝𝑥G_{\tilde{p}_{x}}italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the super-pixel i~ysubscript~𝑖𝑦\tilde{i}_{y}over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is expanded to a local pixels patch Gi~ysubscript𝐺subscript~𝑖𝑦G_{\tilde{i}_{y}}italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Considering the uneven geometric distribution of point cloud and computational efficiency, only the node points in local patch groups are selected to establish correspondences. For each node point pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we select the pixel ikGi~ysubscript𝑖𝑘subscript𝐺subscript~𝑖𝑦i_{k}\in G_{\tilde{i}_{y}}italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT that lies nearest in the feature space. Point-pixel pairs in each super-point-to-super-pixel pair are stacked together as the fine corresponding pairs ΩsubscriptΩ\Omega_{\mathcal{M}}roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT. With point feature map 𝐅Gp~subscript𝐅subscript𝐺~𝑝\mathbf{F}_{G_{\tilde{p}}}bold_F start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT and corresponding pixel feature map 𝐅Gi~subscript𝐅subscript𝐺~𝑖\mathbf{F}_{G_{\tilde{i}}}bold_F start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT of local patch, the fine matching process is defined as:

Ω={(pnGp~,ikGi~)|k=argmin|Gi~|𝐅Gp~(pn)𝐅Gi~(ik)}.\begin{split}\Omega_{\mathcal{M}}=\big{\{}(p_{n}\in G_{\tilde{p}},i_{k}\in G_{% \tilde{i}})\lvert k=\mathop{\mathrm{argmin}}\limits_{\lvert G_{\tilde{i}}% \rvert}\|\mathbf{F}_{G_{\tilde{p}}}(p_{n})-\mathbf{F}_{G_{\tilde{i}}}(i_{k})\|% \big{\}}.\end{split}start_ROW start_CELL roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = { ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG end_POSTSUBSCRIPT ) | italic_k = roman_argmin start_POSTSUBSCRIPT | italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ∥ bold_F start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - bold_F start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ } . end_CELL end_ROW (6)

III-D EPnP-RANSAC based Pose Estimation

With the precise point-pixel pairs ΩsubscriptΩ\Omega_{\mathcal{M}}roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT, the relative transformation can be solved with EPnP [29] algorithm. As mentioned in previous approaches, wrong matching may infiltrate into the point-pixel pairs and decrease the registration accuracy. In the CoFiI2P, the EPnP-RANSAC [29, 20] algorithm is used for robustly estimating camera relative pose.

III-E Loss Function

To learn the coarse level descriptors, fine level descriptors and in/beyond-frustum super-points classification simultaneously, we introduce a joint loss \mathcal{L}caligraphic_L consisting of coarse level descriptor loss coarsesubscript𝑐𝑜𝑎𝑟𝑠𝑒\mathcal{L}_{coarse}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT , fine level descriptor loss finesubscript𝑓𝑖𝑛𝑒\mathcal{L}_{fine}caligraphic_L start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT and classification loss classifysubscript𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦\mathcal{L}_{classify}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_y end_POSTSUBSCRIPT. The classification loss encourages the network to correctly label each super-point, and two descriptor losses pull positive matching pairs closer together and push negative matching pairs farther apart in the feature space.

The cosine similarity s(px,iy)𝑠subscript𝑝𝑥subscript𝑖𝑦s(p_{x},i_{y})italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) of point cloud feature vector 𝐅pxsubscript𝐅subscript𝑝𝑥\mathbf{F}_{p_{x}}bold_F start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT and image feature vector 𝐅iysubscript𝐅subscript𝑖𝑦\mathbf{F}_{i_{y}}bold_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT is defined as :

s(px,iy)=<𝐅px,𝐅iy>𝐅px𝐅iy,s(p_{x},i_{y})=\frac{<{\mathbf{F}_{p_{x}}},{\mathbf{F}_{i_{y}}}>}{\|{\mathbf{F% }_{p_{x}}}\|\|{\mathbf{F}_{i_{y}}}\|},italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = divide start_ARG < bold_F start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT > end_ARG start_ARG ∥ bold_F start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∥ bold_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ end_ARG , (7)

and the distance d(px,iy)𝑑subscript𝑝𝑥subscript𝑖𝑦d(p_{x},i_{y})italic_d ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) is defined as:

d(px,iy)=1s(px,iy).𝑑subscript𝑝𝑥subscript𝑖𝑦1𝑠subscript𝑝𝑥subscript𝑖𝑦d(p_{x},i_{y})=1-s(p_{x},i_{y}).italic_d ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = 1 - italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) . (8)

On the coarse level, the positive anchor i~possubscript~𝑖𝑝𝑜𝑠\tilde{i}_{pos}over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT for each in-frustum super-point p~xsubscript~𝑝𝑥\tilde{p}_{x}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is sampled from the ground-truth pairs set Ω~subscript~Ωsuperscript\tilde{\Omega}_{\mathcal{M}^{\star}}over~ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT:

Ω~={(p~xP~,i~posI~)|i~pos=Γ(𝐓p~x)},\tilde{\Omega}_{\mathcal{M}^{\star}}=\{(\tilde{p}_{x}\in\tilde{P},\tilde{i}_{% pos}\in\tilde{I})\lvert\tilde{i}_{pos}=\Gamma(\mathbf{T}\tilde{p}_{x})\},over~ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ over~ start_ARG italic_P end_ARG , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∈ over~ start_ARG italic_I end_ARG ) | over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_Γ ( bold_T over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) } , (9)

where 𝐓𝐓\mathbf{T}bold_T is the ground-truth transform matrix from the point cloud coordinate system to the image frustum coordinate system, and ΓΓ\Gammaroman_Γ represents the mapping function that converts points from the camera frustum to the image plane coordinate system. The negative anchor i~negsubscript~𝑖𝑛𝑒𝑔\tilde{i}_{neg}over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT is defined as the super-pixel with the smallest distance to the p~xsubscript~𝑝𝑥\tilde{p}_{x}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT in the feature space:

i~neg=argmini~negI~d(𝐅p~x,𝐅i~neg)s.t.i~negi~pos>r,\tilde{i}_{neg}=\mathop{\mathrm{argmin}}\limits_{\tilde{i}_{neg}\in\tilde{I}}{% \|d(\mathbf{F}_{\tilde{p}_{x}},\mathbf{F}_{\tilde{i}_{neg}})\|}\quad s.t.\quad% \|\tilde{i}_{neg}-\tilde{i}_{pos}\|>r,over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ∈ over~ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ∥ italic_d ( bold_F start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ italic_s . italic_t . ∥ over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT - over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∥ > italic_r , (10)

where r𝑟ritalic_r is the safe radius to remove adversarial samples. Finally, with positive margin ΔpossubscriptΔ𝑝𝑜𝑠\Delta_{pos}roman_Δ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and negative margin ΔnegsubscriptΔ𝑛𝑒𝑔\Delta_{neg}roman_Δ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT, coarse level descriptor loss is defined in a triplet way [28] as Eq. (11) :

coarse=(p~x,i~pos,i~neg)(max(0,d(p~x,i~pos)Δpos)+max(0,Δnegd(p~x,i~neg))).subscript𝑐𝑜𝑎𝑟𝑠𝑒subscriptsubscript~𝑝𝑥subscript~𝑖𝑝𝑜𝑠subscript~𝑖𝑛𝑒𝑔max0𝑑subscript~𝑝𝑥subscript~𝑖𝑝𝑜𝑠subscriptΔ𝑝𝑜𝑠max0subscriptΔ𝑛𝑒𝑔𝑑subscript~𝑝𝑥subscript~𝑖𝑛𝑒𝑔\begin{split}\mathcal{L}_{coarse}=\sum_{(\tilde{p}_{x},\tilde{i}_{pos},\tilde{% i}_{neg})}\bigg{(}&\mathrm{max}\big{(}0,d(\tilde{p}_{x},\tilde{i}_{pos})-% \Delta_{pos}\big{)}+\\ &\mathrm{max}\big{(}0,\Delta_{neg}-d(\tilde{p}_{x},\tilde{i}_{neg})\big{)}% \bigg{)}.\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( end_CELL start_CELL roman_max ( 0 , italic_d ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) - roman_Δ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_max ( 0 , roman_Δ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT - italic_d ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) ) ) . end_CELL end_ROW (11)

Fine level descriptor loss is defined as modified circle loss [36]. We randomly select n𝑛nitalic_n in-frustum points and their positive anchor pixels and negative anchors, as Eq. (10), then the descriptor loss is defined as:

fine=log(1+exp(γαpos(s(px,ipos)Δpos))(px,ipos,ineg)exp(γαneg(px,s(px,ipos)Δneg))),subscript𝑓𝑖𝑛𝑒1𝛾subscript𝛼𝑝𝑜𝑠𝑠subscript𝑝𝑥subscript𝑖𝑝𝑜𝑠subscriptΔ𝑝𝑜𝑠subscriptsubscript𝑝𝑥subscript𝑖𝑝𝑜𝑠subscript𝑖𝑛𝑒𝑔𝛾subscript𝛼𝑛𝑒𝑔subscript𝑝𝑥𝑠subscript𝑝𝑥subscript𝑖𝑝𝑜𝑠subscriptΔ𝑛𝑒𝑔\begin{split}\mathcal{L}_{fine}=\log\bigg{(}&1+\exp\big{(}{-\gamma\alpha_{pos}% (s(p_{x},i_{pos})-\Delta_{pos})}\big{)}\\ &\sum_{(p_{x},i_{pos},i_{neg})}\exp\big{(}{\gamma{\alpha}_{neg}(p_{x},s(p_{x},% i_{pos})-\Delta_{neg})}\big{)}\bigg{)},\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT = roman_log ( end_CELL start_CELL 1 + roman_exp ( - italic_γ italic_α start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) - roman_Δ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_exp ( italic_γ italic_α start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) - roman_Δ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) ) ) , end_CELL end_ROW (12)

where αnegsubscript𝛼𝑛𝑒𝑔\alpha_{neg}italic_α start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT and αpossubscript𝛼𝑝𝑜𝑠\alpha_{pos}italic_α start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT are the dynamic optimizing rates towards negative and positive pairs, and γ𝛾\gammaitalic_γ is the scale factor. As in [36], the αnegsubscript𝛼𝑛𝑒𝑔\alpha_{neg}italic_α start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT and αpossubscript𝛼𝑝𝑜𝑠\alpha_{pos}italic_α start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT are defined as:

αneg=max(0,s(px,ineg)+Δneg),αpos=max(0,1+Δposs(px,ipos)).formulae-sequencesubscript𝛼𝑛𝑒𝑔max0𝑠subscript𝑝𝑥subscript𝑖𝑛𝑒𝑔subscriptΔ𝑛𝑒𝑔subscript𝛼𝑝𝑜𝑠max01subscriptΔ𝑝𝑜𝑠𝑠subscript𝑝𝑥subscript𝑖𝑝𝑜𝑠\begin{split}&\alpha_{neg}=\mathrm{max}\big{(}0,s(p_{x},i_{neg})+\Delta_{neg}% \big{)},\\ &\alpha_{pos}=\mathrm{max}\big{(}0,1+\Delta_{pos}-s(p_{x},i_{pos})\big{)}.\end% {split}start_ROW start_CELL end_CELL start_CELL italic_α start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = roman_max ( 0 , italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) + roman_Δ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_α start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_max ( 0 , 1 + roman_Δ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT - italic_s ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) ) . end_CELL end_ROW (13)

Super-points classification is defined as a binary cross-entropy loss:

classify=p~xP~(s^p~xlog(sp~x)+(1s^p~x)log(1sp~x)),subscript𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦subscriptsubscript~𝑝𝑥~𝑃subscript^𝑠subscript~𝑝𝑥subscript𝑠subscript~𝑝𝑥1subscript^𝑠subscript~𝑝𝑥1subscript𝑠subscript~𝑝𝑥\mathcal{L}_{classify}=-\sum_{\tilde{p}_{x}\in\tilde{P}}\big{(}{\hat{s}_{% \tilde{p}_{x}}}\log({s_{\tilde{p}_{x}}})+(1-{\hat{s}_{\tilde{p}_{x}}})\log{(1-% {s_{\tilde{p}_{x}}})}\big{)},caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_y end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_s start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log ( 1 - italic_s start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (14)

To avoid ambiguity, we reuse sp~xsubscript𝑠subscript~𝑝𝑥s_{\tilde{p}_{x}}italic_s start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT for score prediction and s^p~xsubscript^𝑠subscript~𝑝𝑥\hat{s}_{\tilde{p}_{x}}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT for ground truth label. Overall, our loss function is

=λ1coarse+λ2fine+λ3classify,subscript𝜆1subscript𝑐𝑜𝑎𝑟𝑠𝑒subscript𝜆2subscript𝑓𝑖𝑛𝑒subscript𝜆3subscript𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦\mathcal{L}=\lambda_{1}\mathcal{L}_{coarse}+\lambda_{2}\mathcal{L}_{fine}+% \lambda_{3}\mathcal{L}_{classify},caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_y end_POSTSUBSCRIPT , (15)

where λ1,λ2,λ3subscript𝜆1subscript𝜆2subscript𝜆3\lambda_{1},\lambda_{2},\lambda_{3}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are hyperparameters that control the weights between losses.

IV Experiments

IV-A Experiment Setup

IV-A1 Dataset

We evaluate our method on two public datasets, KITTI Odometry [37] and Nuscenes [38].

  • KITTI Odometry [37]: It consists of 11 sequences of images and point cloud data in urban environments with ground-truth calibration files. The camera intrinsic function ΓΓ\Gammaroman_Γ extracted from calibration files is thought unbiased during experiments. Sequences 0-8 are used for training and 9-10 for testing as previous approaches [8, 9]. The image resolution is resized to 160×\times×512 and the point cloud is downsampled and randomly selected to 20480.

  • Nuscenes [38]: The image and point cloud pairs are generated by the official SDK, where the point cloud is accumulated from the nearby frames and the image is from the current frame. We follow the official data split of nuScenes to utilize 850 scenes for training, and 150 scenes for testing. We downsample the image resolution to 160×\times×320 and the point cloud size to 20480 similarly.

IV-A2 Baseline Methods

We compare the proposed CoFiI2P with two open-sourced I2P methods as follows.

  • DeepI2P [8]: It uses the frustum classification and inverse camera projection to estimate the camera pose. We use officially released code for reimplementation and use the 2D and 3D inverse camera projection for optimization, namely DeepI2P (2D) and DeepI2P (3D).

  • CorrI2P [9]: It is the SOTA I2P method. CorrI2P predicts the overlapping area and establishes correspondences densely for pose estimation. We use officially released code for reimplementation.

IV-A3 Evaluation Metrics

We calculate relative rotation error (RRE), relative translation error (RTE) and registration recall (RR) to evaluate the registration accuracy. Inlier ratio (IR) used in previous approaches [22, 39] is introduced to evaluate the quality of correspondences. For efficiency analysis, we report the frame-per-second (FPS) on the KITTI Odometry dataset, evaluated with a batch size of 1 on an Intel i9-13900K CPU and an NVIDIA RTX 4090 GPU. RRE and RTE are defined as :

RRE=i=13|𝐫(i)|,RTE=𝐭gt𝐭e,\begin{split}\mathrm{RRE}=\sum_{i=1}^{3}\lvert\mathbf{r}(i)\rvert,\quad\mathrm% {RTE}=\|\mathbf{t}_{gt}-\mathbf{t}_{e}\|,\end{split}start_ROW start_CELL roman_RRE = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | bold_r ( italic_i ) | , roman_RTE = ∥ bold_t start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - bold_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ , end_CELL end_ROW (16)

where 𝐫𝐫\mathbf{r}bold_r is the Euler angle vector of 𝐑gt1𝐑esuperscriptsubscript𝐑𝑔𝑡1subscript𝐑𝑒\mathbf{R}_{gt}^{-1}\mathbf{R}_{e}bold_R start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, 𝐑gtsubscript𝐑𝑔𝑡\mathbf{R}_{gt}bold_R start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT and 𝐭gtsubscript𝐭𝑔𝑡\mathbf{t}_{gt}bold_t start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT are the ground-truth rotation and translation matrix, 𝐑esubscript𝐑𝑒\mathbf{R}_{e}bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝐭esubscript𝐭𝑒\mathbf{t}_{e}bold_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are the estimated rotation and translation matrix.

RR metric estimates the percentage of correctly matched I2P pairs, indicating the descriptor learning ability of the network. RR is defined as:

RR=𝟙(RRE<δrandRTE<δt),RR1formulae-sequenceRREsubscript𝛿𝑟andRTEsubscript𝛿𝑡\mathrm{RR}=\mathds{1}(\mathrm{RRE}<\delta_{r}\quad\textit{and}\quad\mathrm{% RTE}<\delta_{t}),roman_RR = blackboard_1 ( roman_RRE < italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and roman_RTE < italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (17)

(δr,δt)subscript𝛿𝑟subscript𝛿𝑡(\delta_{r},\delta_{t})( italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the threshold to remove false registration results, i.e. (10,5m)superscript105𝑚(10^{\circ},5m)( 10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 5 italic_m ). IR denotes the inlier ratio of matching pairs, which measures the accuracy of correspondences. The IR for point/pixel correspondences set ΩsubscriptΩ\Omega_{\mathcal{M}}roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT is defined as :

IR=1|Ω|(px,iy)Ω𝟙(Γ(𝐓px)iy<τ),subscriptIR1subscriptΩsubscriptsubscript𝑝𝑥subscript𝑖𝑦subscriptΩ1normΓ𝐓subscript𝑝𝑥subscript𝑖𝑦𝜏\mathrm{IR}_{\mathcal{M}}=\frac{1}{\lvert\Omega_{\mathcal{M}}\rvert}\sum_{(p_{% x},i_{y})\in\Omega_{\mathcal{M}}}\mathds{1}(\|\Gamma(\mathbf{T}p_{x})-i_{y}\|<% \tau),\vspace{-0.1cm}roman_IR start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ∈ roman_Ω start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( ∥ roman_Γ ( bold_T italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) - italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ < italic_τ ) , (18)

in which τ𝜏\tauitalic_τ is used to control the reprojection error tolerance.

IV-A4 Implementation Details

With correct transform parameters provided by the calibration files during the training process, the ground truth correspondences ΩsubscriptΩsuperscript\Omega_{\mathcal{M^{\star}}}roman_Ω start_POSTSUBSCRIPT caligraphic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are established to supervise the network. We trained the whole network 25 epochs for the KITTI Odometry dataset and 10 epochs for the Nuscenes dataset. We use the Adam [40] to optimize the network, and the initial learning rate is 0.001 and multiple by 0.25 after every 5 epochs. For our joint loss, we set λ1=λ2=λ3=1subscript𝜆1subscript𝜆2subscript𝜆31\lambda_{1}=\lambda_{2}=\lambda_{3}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1. The safe radius r𝑟ritalic_r, positive margin ΔpossubscriptΔ𝑝𝑜𝑠\Delta_{pos}roman_Δ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and negative margin ΔnegsubscriptΔ𝑛𝑒𝑔\Delta_{neg}roman_Δ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT in loss function are set to 1, 0.2 and 1.8. Scale factor γ𝛾\gammaitalic_γ is set to 10. Model configurations and more implementation details are available in our open-source code.

TABLE I: Registration accuracy on the KITTI Odometry [37] and Nuscenes [38] datasets. The results are presented in the format of "mean ±plus-or-minus\pm± standard deviation" among the test samples. \uparrow means higher is better and \downarrow means lower is better, respectively. The best results are highlighted in bold. Results of DeepI2P on the KITTI Odometry dataset are taken from [8] with identical settings.
Method Threshold(/m{}^{\circ}/mstart_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT / italic_m) KITTI Odometry Nuscenes FPS\uparrow
RRE()\downarrow RTE(m)\downarrow RR(%) \uparrow RRE()\downarrow RTE(m)\downarrow RR(%) \uparrow
DeepI2P (2D) [8] 10/5 7.56±7.63plus-or-minus7.567.637.56\pm 7.637.56 ± 7.63 3.28±3.09plus-or-minus3.283.093.28\pm 3.093.28 ± 3.09 - 2.81±2.23plus-or-minus2.812.232.81\pm 2.232.81 ± 2.23 1.48±0.88plus-or-minus1.480.881.48\pm 0.881.48 ± 0.88 95.13 1.41
DeepI2P (3D) [8] 10/5 15.52±12.73plus-or-minus15.5212.7315.52\pm 12.7315.52 ± 12.73 3.17±3.22plus-or-minus3.173.223.17\pm 3.223.17 ± 3.22 - 6.60±19.33plus-or-minus6.6019.336.60\pm 19.336.60 ± 19.33 1.31±1.57plus-or-minus1.311.571.31\pm 1.571.31 ± 1.57 38.64 0.71
CorrI2P [9] none/none 6.93±29.03plus-or-minus6.9329.036.93\pm 29.036.93 ± 29.03 2.68±10.51plus-or-minus2.6810.512.68\pm 10.512.68 ± 10.51 - 8.47±24.02plus-or-minus8.4724.028.47\pm 24.028.47 ± 24.02 5.30±8.10plus-or-minus5.308.105.30\pm 8.105.30 ± 8.10 - 7.95
45/10 3.41±3.64plus-or-minus3.413.643.41\pm 3.643.41 ± 3.64 1.48±1.35plus-or-minus1.481.351.48\pm 1.351.48 ± 1.35 97.07 5.42±4.30plus-or-minus5.424.305.42\pm 4.305.42 ± 4.30 3.78±2.24plus-or-minus3.782.243.78\pm 2.243.78 ± 2.24 91.07
10/5 2.70±1.97plus-or-minus2.701.972.70\pm 1.972.70 ± 1.97 1.24±0.87plus-or-minus1.240.871.24\pm 0.871.24 ± 0.87 90.66 3.90±2.23plus-or-minus3.902.233.90\pm 2.233.90 ± 2.23 2.61±1.23plus-or-minus2.611.232.61\pm 1.232.61 ± 1.23 61.67
CoFiI2P none/none 1.14±0.78plus-or-minus1.140.78\boldsymbol{1.14}\pm\boldsymbol{0.78}bold_1.14 ± bold_0.78 0.29±0.19plus-or-minus0.290.19\boldsymbol{0.29}\pm\boldsymbol{0.19}bold_0.29 ± bold_0.19 - 2.04±5.54plus-or-minus2.045.54\boldsymbol{2.04}\pm\boldsymbol{5.54}bold_2.04 ± bold_5.54 0.95±5.04plus-or-minus0.955.04\boldsymbol{0.95}\pm\boldsymbol{5.04}bold_0.95 ± bold_5.04 - 15.43
45/10 1.14±0.78plus-or-minus1.140.78\boldsymbol{1.14}\pm\boldsymbol{0.78}bold_1.14 ± bold_0.78 0.29±0.19plus-or-minus0.290.19\boldsymbol{0.29}\pm\boldsymbol{0.19}bold_0.29 ± bold_0.19 100.00 1.88±1.77plus-or-minus1.881.77\boldsymbol{1.88}\pm\boldsymbol{1.77}bold_1.88 ± bold_1.77 0.81±0.59plus-or-minus0.810.59\boldsymbol{0.81}\pm\boldsymbol{0.59}bold_0.81 ± bold_0.59 99.84
10/5 1.14±0.78plus-or-minus1.140.78\boldsymbol{1.14}\pm\boldsymbol{0.78}bold_1.14 ± bold_0.78 0.29±0.19plus-or-minus0.290.19\boldsymbol{0.29}\pm\boldsymbol{0.19}bold_0.29 ± bold_0.19 100.00 1.79±1.22plus-or-minus1.791.22\boldsymbol{1.79}\pm\boldsymbol{1.22}bold_1.79 ± bold_1.22 0.79±0.53plus-or-minus0.790.53\boldsymbol{0.79}\pm\boldsymbol{0.53}bold_0.79 ± bold_0.53 99.18

IV-B Registration Accuracy

We report the RRE and RTE as evaluation metrics under three different settings in Table I, where none/none𝑛𝑜𝑛𝑒𝑛𝑜𝑛𝑒none/noneitalic_n italic_o italic_n italic_e / italic_n italic_o italic_n italic_e means no specific thresholds for filtering out false registration frames and 10/5msuperscript105𝑚10^{\circ}/5m10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / 5 italic_m and 45/10msuperscript4510𝑚45^{\circ}/10m45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / 10 italic_m mean that any frames with registration errors exceeding these specified thresholds are ignored during the evaluation process. As shown in Table I, the proposed method outperforms all baseline methods on the RRE and RTE registration metrics. Notably, our method achieves 100% RR and 99.18% RR under the hardest pair of thresholds 10/5msuperscript105𝑚10^{\circ}/5m10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / 5 italic_m on the KITTI Odometry and Nuscenes datasets respectively, which indicates that our method constructs robust correspondences and achieves accurate registration performance in most of evaluation scenes. Besides, we project the points to the image plane with transform parameters provided by ground-truth files, CorrI2P [9] and CoFiI2P respectively and deploy the qualitative registration results in Fig. 3. It shows that the CoFiI2P remains stable and provides more accurate results, which confirms our claim. Although our method is slightly slower during the model inference step, the pose estimation is extremely fast due to the small quantity and high quality of matching pairs, thereby still maintaining online speed.

Refer to caption
Figure 3: Quantitative registration results on the KITTI Odometry dataset. The colors are rendered based on depth, ranging from blue in the foreground to red in the distance.
Refer to caption
Figure 4: Quantitative results of correspondences. (a) and (b) shows the inlier ratio of our method (blue line) and CorrI2P (orange line) on the KITTI Odometry and Nuscenes dataset respectively.
Refer to caption
Figure 5: Qualitative results of correspondences on the KITTI Odometry [37] dataset. The first column shows the input, and the second and third column shows the correspondences of CorrI2P and our CoFiI2P respectively. The green lines represent correct matches, while the red lines represent incorrect matches.

For the super-point frustum classification, We introduce the IR metrics to evaluate the quality of the correspondence as in the P2P registration approach [22]. IR curves in Fig. 5 indicate that our method achieves a higher proportion of correct matches among the established correspondences, which leads to better registration results. Fig. 5 intuitively illustrates that our CoFiI2P provides a much cleaner correspondence set than baseline method.

IV-C Ablation Studies and Analysis

In this part, we analyze three crucial factors in our CoFiI2P: I2P transformer module, coarse-to-fine matching scheme, and point cloud density. We conduct ablation studies on the KITTI Odometry [37] dataset to prove the effectiveness of each module and review the influence of point cloud density. We train the ablation models for 25 epochs as in the experimental part, and all other settings remain the same. We report global RRE and RTE as evaluation metrics and no thresholds are used to reject false registration scenes.

IV-C1 Analysis of the I2P Transformer

I2P transformer [41] with self-attention module and cross-attention module is crucial to image-to-point cloud alignment at the global level. In this part, we conduct ablation studies to assess the effectiveness of the I2P transformer. We train the CoFiI2P without any attention module as baseline. Then, the self-attention modules and cross-attention modules are added on the coarse level respectively. As shown in Table IV, performance of the baseline method drops significantly without any attention module. Besides, the self-attention module reduces the RRE to 1.78superscript1.781.78^{\circ}1.78 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and the RTE about 0.41m0.41𝑚0.41m0.41 italic_m, and the cross-attention module reduces the RRE to 1.74superscript1.741.74^{\circ}1.74 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and the RTE to 0.43m0.43𝑚0.43m0.43 italic_m respectively. With both the self-attention and cross-attention modules, the CoFiI2P achieves the smallest registration error. Moreover, with both self-attention and cross-attention modules, the variance reduces by a large margin, which indicates that the I2P transformer block enhances both the accuracy and robustness.

IV-C2 Analysis of the Coarse-to-fine Matching Scheme

our CoFiI2P proposes to estimate coarse correspondences at super-point/-pixel level first and then generate fine correspondences at point-pixel level sequentially. We conduct ablation experiments on the coarse-to-fine matching scheme to demonstrate that the progressive two-stage registration operates better than the one-stage registration used in previous I2P approaches. This ablation study employs the backbone with full I2P transformer blocks as baseline and evaluates the registration accuracy with only coarse or fine matching schemes. For coarse matching only, matching pairs are established on the coarse level and remapped to the fine resolution for pose estimation. By contrast, for fine matching only, matching pairs are established on the fine level directly, without guidance of coarse level correspondences. Experiment results in Table IV show that removing either the coarse matching stage or the fine matching stage leads to higher registration error and variance. We indicate that coarse-level registration provides robust correspondences and fine-level registration provides accurate matching pairs. Combining the coarse-level and fine-level registration sequentially makes it easier to access the global optimal solution in I2P registration.

IV-C3 Analysis of the Point Cloud Density

Given the significant impact of point cloud density on representative learning, we conducted ablation studies to examine this influence. The results in Table IV present registration accuracy and computational complexity for various point cloud densities. The findings reveal that as point cloud density decreases, the qualitative metrics deteriorate, while higher point cloud densities correspond to a sharp increase in computational complexity. It’s an intuitive observation that low-density point clouds lose local structured information, and high-density point clouds carry a heavy computational burden. As a result, we opt for a compromise and select 20480 points, striking a balance between efficiency and accuracy.

TABLE II: Ablation study on the I2P transformer blocks. SA denotes the self-attention module and CA denotes the cross-attention module.
Baseline SA CA RRE() RTE(m)
2.65±4.79plus-or-minus2.654.792.65\pm 4.792.65 ± 4.79 0.87±2.23plus-or-minus0.872.230.87\pm 2.230.87 ± 2.23
1.78±2.28plus-or-minus1.782.281.78\pm 2.281.78 ± 2.28 0.41±0.79plus-or-minus0.410.790.41\pm 0.790.41 ± 0.79
1.74±1.50plus-or-minus1.741.501.74\pm 1.501.74 ± 1.50 0.43±0.35plus-or-minus0.430.350.43\pm 0.350.43 ± 0.35
1.14±0.78plus-or-minus1.140.78\mathbf{1.14\pm 0.78}bold_1.14 ± bold_0.78 0.29±0.19plus-or-minus0.290.19\mathbf{0.29\pm 0.19}bold_0.29 ± bold_0.19
TABLE III: Ablation study on the coarse-to-fine matching scheme. CM denotes the coarse matching and FM denotes the fine matching.
Baseline CM FM RRE() RTE(m)
1.35±1.17plus-or-minus1.351.171.35\pm 1.171.35 ± 1.17 0.34±0.26plus-or-minus0.340.260.34\pm 0.260.34 ± 0.26
1.47±1.75plus-or-minus1.471.751.47\pm 1.751.47 ± 1.75 0.41±0.63plus-or-minus0.410.630.41\pm 0.630.41 ± 0.63
1.14±0.78plus-or-minus1.140.78\mathbf{1.14\pm 0.78}bold_1.14 ± bold_0.78 0.29±0.19plus-or-minus0.290.19\mathbf{0.29\pm 0.19}bold_0.29 ± bold_0.19
TABLE IV: Ablation study on the point cloud density.
#Points RRE() RTE(m) FLOPs
5120 2.75±4.28plus-or-minus2.754.282.75\pm 4.282.75 ± 4.28 0.69±1.60plus-or-minus0.691.600.69\pm 1.600.69 ± 1.60 37.42G
10240 1.69±1.43plus-or-minus1.691.431.69\pm 1.431.69 ± 1.43 0.39±0.30plus-or-minus0.390.300.39\pm 0.300.39 ± 0.30 56.00G
20480 1.14±0.78plus-or-minus1.140.781.14\pm 0.781.14 ± 0.78 0.29±0.19plus-or-minus0.290.190.29\pm 0.190.29 ± 0.19 93.80G
40960 1.00±0.70plus-or-minus1.000.701.00\pm 0.701.00 ± 0.70 0.26±0.17plus-or-minus0.260.170.26\pm 0.170.26 ± 0.17 171.91G

V Conclusion

This letter introduces CoFiI2P, a novel network designed for image-to-point cloud (I2P) registration. The proposed coarse-to-fine matching strategy first establishes robust global correspondences and then progressively refines precise local correspondences. Furthermore, the I2P transformer with self- and cross-attention modules is introduced to enhance the global-aware ability in homogeneous and heterogeneous data. Compared with existing one-stage dense prediction and matching approaches, CoFiI2P filters out a large number of false correspondences. Extensive experiments on the KITTI Odometry [37] and Nuscenes [38] dataset have demonstrated the superior accuracy, efficiency, and robustness of CoFiI2P in various environments. We hope the open-sourced CoFiI2P could benefit the relevant communities. In the near future, we will extend CoFiI2P to the unsupervised I2P registration.

References

  • [1] L. Wang, X. Zhang, W. Qin, X. Li, J. Gao, L. Yang, Z. Li, J. Li, L. Zhu, H. Wang et al., “Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion,” IEEE Trans. on Intelligent Transportation Systems (T-ITS), 2023.
  • [2] J. Li, S. Yuan, M. Cao, T.-M. Nguyen, K. Cao, and L. Xie, “Hcto: Optimality-aware lidar inertial odometry with hybrid continuous time optimization for compact wearable mapping system,” ISPRS Journal of Photogrammetry and Remote Sensing (ISPRS), vol. 211, pp. 228–243, 2024.
  • [3] J. Li, B. Yang, C. Chen, R. Huang, Z. Dong, and W. Xiao, “Automatic registration of panoramic image sequence and mobile laser scanning data using semantic features,” ISPRS Journal of Photogrammetry and Remote Sensing (ISPRS), vol. 136, pp. 41–57, 2018.
  • [4] X. Yan, J. Gao, C. Zheng, C. Zheng, R. Zhang, S. Cui, and Z. Li, “2dpass: 2d priors assisted semantic segmentation on lidar point clouds,” in Proc. of the Europ. Conf. on Computer Vision (ECCV).   Springer, 2022, pp. 677–695.
  • [5] X. Zhang, S. Zhu, S. Guo, J. Li, and H. Liu, “Line-based automatic extrinsic calibration of lidar and camera,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2021, pp. 9347–9353.
  • [6] C. Yuan, X. Liu, X. Hong, and F. Zhang, “Pixel-level extrinsic self calibration of high resolution lidar and camera in targetless environments,” IEEE Robotics and Automation Letters (RA-L), vol. 6, no. 4, pp. 7517–7524, 2021.
  • [7] M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2019, pp. 4790–4796.
  • [8] J. Li and G. H. Lee, “Deepi2p: Image-to-point cloud registration via deep classification,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 960–15 969.
  • [9] S. Ren, Y. Zeng, J. Hou, and X. Chen, “Corri2p: Deep image-to-point cloud registration via dense correspondence,” IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 3, pp. 1198–1208, 2022.
  • [10] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8922–8931.
  • [11] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “Cotr: Correspondence transformer for matching across images,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6207–6217.
  • [12] Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, and K. Xu, “Geometric transformer for fast and robust point cloud registration,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 143–11 152.
  • [13] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Intl. Journal of Computer Vision (IJCV), vol. 60, pp. 91–110, 2004.
  • [14] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2011, pp. 2564–2571.
  • [15] DeTone, Daniel and Malisiewicz, Tomasz and Rabinovich, Andrew, “Superpoint: Self-supervised interest point detection and description,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 224–236.
  • [16] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4938–4947.
  • [17] Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2pix: Epipolar-guided pixel-level correspondences,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4669–4678.
  • [18] H. Deng, T. Birdal, and S. Ilic, “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2018, pp. 602–618.
  • [19] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser, “The perfect match: 3d point cloud matching with smoothed densities,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5545–5554.
  • [20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [21] H. Wang, Y. Liu, Q. Hu, B. Wang, J. Chen, Z. Dong, Y. Guo, W. Wang, and B. Yang, “Roreg: Pairwise point cloud registration with oriented descriptors and local rotations,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
  • [22] H. Yu, F. Li, M. Saleh, B. Busam, and S. Ilic, “Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration,” Proc. of the Advances in Neural Information Processing Systems (NIPS), vol. 34, pp. 23 872–23 884, 2021.
  • [23] J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers.” in Proc. of Robotics: Science and Systems (RSS), 2013.
  • [24] Y. Liao, J. Li, S. Kang, Q. Li, G. Zhu, S. Yuan, Z. Dong, and B. Yang, “Se-calib: Semantic edges based lidar-camera boresight online calibration in urban scenes,” IEEE Trans. on Geoscience and Remote Sensing (TGRS), 2023.
  • [25] Q. Zhang and R. Pless, “Extrinsic calibration of a camera and laser range finder (improves camera calibration),” in Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), vol. 3, 2004, pp. 2301–2306.
  • [26] D. Scaramuzza, A. Harati, and R. Siegwart, “Extrinsic self calibration of a camera and a 3d laser range finder from natural scenes,” in Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2007, pp. 4164–4169.
  • [27] Y. Zhong, “Intrinsic shape signatures: A shape descriptor for 3d object recognition,” in Proc. of the Int. Conf. on Computer Vision Workshops (ICCVW), 2009, pp. 689–696.
  • [28] B. Lai, W. Liu, C. Wang, X. Bian, Y. Su, X. Lin, Z. Yuan, S. Shen, and M. Cheng, “Learning cross-domain descriptors for 2d-3d matching with hard triplet loss and spatial transformer network,” in Image and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6–8, 2021, Proceedings, Part III 11.   Springer, 2021, pp. 15–27.
  • [29] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the p n p problem,” Intl. Journal of Computer Vision (IJCV), vol. 81, pp. 155–166, 2009.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [31] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2019, pp. 6411–6420.
  • [32] W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, “Topformer: Token pyramid transformer for mobile semantic segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 083–12 093.
  • [33] Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2023.
  • [34] H. Yu, Z. Qin, J. Hou, M. Saleh, D. Li, B. Busam, and S. Ilic, “Rotation-invariant transformer for point cloud matching,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5384–5393.
  • [35] K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improving relative position encoding for vision transformer,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2021, pp. 10 033–10 041.
  • [36] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6398–6407.
  • [37] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.
  • [38] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  • [39] B. Wang, C. Chen, Z. Cui, J. Qin, C. X. Lu, Z. Yu, P. Zhao, Z. Dong, F. Zhu, N. Trigoni et al., “P2-net: Joint description and detection of local features for pixel and point matching,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2021, pp. 16 004–16 013.
  • [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2015.
  • [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2021.