CoFiI2P: Coarse-to-Fine Correspondences-Based Image to Point Cloud Registration

Shuhao Kang^*, Youqi Liao^*, Jianping Li^†, , Fuxun Liang, Yuhao Li, Xianghong Zou,
Fangning Li, Xieyuanli Chen, Zhen Dong, , Bisheng Yang Manuscript received: May. 14, 2024; Revised: Aug. 8, 2024; Accepted: Sep. 9, 2024. This paper was recommended for publication by Editor Cesar Cadena Lerma upon evaluation of the Associate Editor and Reviewers’ comments. Digital Object Identifier (DOI): see top of this page. This study was jointly supported by the National Natural Science Foundation Project (No. 42201477, No. 42130105), the Open Fund of Hubei Luojia Laboratory (No. 2201000054) and Open Fund of Key Laboratory of Urban Spatial Information, Ministry of Natural Resources (Grant No. 2023ZD001). (Shuhao Kang and Youqi Liao are co-first authors and contribute equally to the paper) (Corresponding author: Jianping Li) S. Kang is with the Technical University of Munich, Germany. Y. Liao is with the Wuhan University and Hubei Luojia Laboratory, China. Z. Dong, F. Liang, Y. Li, X. Zou and B. Yang are with the Wuhan University, China. J. Li is with the Nanyang Technological University, Singapore. F. Li is with the Beijing Urban Construction Exploration and Surveying Design Research Institute Co. Ltd. X. Chen is with the National University of Defense Technology, China.

Abstract

Image-to-point cloud (I2P) registration is a fundamental task for robots and autonomous vehicles to achieve cross-modality data fusion and localization. Current I2P registration methods primarily focus on estimating correspondences at the point or pixel level, often neglecting global alignment. As a result, I2P matching can easily converge to a local optimum if it lacks high-level guidance from global constraints. To improve the success rate and general robustness, this paper introduces CoFiI2P, a novel I2P registration network that extracts correspondences in a coarse-to-fine manner. First, the image and point cloud data are processed through a two-stream encoder-decoder network for hierarchical feature extraction. Second, a coarse-to-fine matching module is designed to leverage these features and establish robust feature correspondences. Specifically, in the coarse matching phase, a novel I2P transformer module is employed to capture both homogeneous and heterogeneous global information from the image and point cloud data. This enables the estimation of coarse super-point/super-pixel matching pairs with discriminative descriptors. In the fine matching module, point/pixel pairs are established with the guidance of super-point/super-pixel correspondences. Finally, based on matching pairs, the transformation matrix is estimated with the EPnP-RANSAC algorithm. Experiments conducted on the KITTI Odometry dataset demonstrate that CoFiI2P achieves impressive results, with a relative rotation error (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29 meters, while maintaining real-time speed. These results represent a significant improvement of 84% in RRE and 89% in RTE compared to the current state-of-the-art (SOTA) method. Additional experiments on the Nuscenes dataset confirm our method’s generalizability. The project page is available at https://whu-usi3dv.github.io/CoFiI2P.

Index Terms— Image-to-Point Cloud Registration, Coarse-to-Fine Correspondences, Transformer Network

Refer to caption — Figure 1: Comparison of existing one-stage I2P registration and proposed coarse-to-fine I2P registration. (a) The existing one-stage registration pipeline. The matching pairs are directly established at the point/pixel level, leading to a significant number of mismatches. (b) Our coarse-to-fine matching pipeline. Under the guidance of super point-to-pixel pairs, point-to-pixel pairs are generated from the existing super pairs, which effectively eliminates most mismatches.

I Introduction

Estimating the six degrees of freedom (6-DoF) pose of a monocular image relative to a pre-built point cloud map is a fundamental requirement for robots and autonomous vehicles [1, 2, 3]. Due to the limited onboard resources, robots are often equipped with only a monocular camera while facing challenges related to scale ambiguity in absolute localization and depth sensing. Establishing an accurate pose transformation between the image coordinate system and the pre-built point cloud coordinate system is crucial. This transformation not only precisely localizes the robot but also effectively reduces the scale uncertainty inherent in the monocular data [4].

However, cross-modality registration has its inherent challenges. Some existing methods use hand-crafted detectors and descriptors for I2P registration [5, 6]. These approaches rely on structured features like edge features [6], which are limited by specific environmental conditions. With the rapid development of deep learning (DL), learning-based I2P registration approaches [7, 8, 9] have been proposed to extract representative keypoints and descriptors. 2D3D-MatchNet [7] proposed a novel Siamese network to learn the cross-modality descriptor but the manually designed detectors matched poorly. To alleviate the difficulty of correspondences construction and improve the registration success rate, DeepI2P [8] converts the registration problem to the classification problem. A novel binary classification network is designed to distinguish whether the projected points are within or beyond the camera frustum. The classification results are passed into an inverse camera projection solver to estimate the transformation between the camera and laser scanners. As a large number of points on the boundary are misclassified, the accuracy of the camera pose is still limited. CorrI2P [9] proposed an overlap region detector for both image and point cloud, then pixels and points in the overlap region are matched to obtain I2P correspondences. Feature fusion module is exploited to fuse the point cloud and image information. Although CorrI2P [9] has significant improvement over DeepI2P [8], matching merely on one stage, namely, the pixel-point level without global alignment guidance, can lead to local minima and instability.

Inspired by recent coarse-to-fine matching schedules and transformers in image-to-image (I2I) registration [10, 11] and point cloud-to-point cloud (P2P) registration approaches [12], this paper proposes the Coarse-to-Fine Image-to-Point cloud (CoFiI2P) network for I2P registration. The I2P transformer with self- and cross-attention modules is embedded into the network for global alignment. Overall, the main contributions of this work are as follows.

1.

A novel coarse-to-fine I2P registration network is proposed to align image and point cloud in a progressive way. The coarse matching step provides rough but robust super-point/super-pixel correspondences for the following fine matching step, which filters out most mismatched pairs and reduces the computation burden. The fine matching step achieves accurate and reliable point/pixel correspondences with the global guidance.
2.

A novel I2P transformer that incorporates both self-attention and cross-attention modules is proposed to enhance its global-aware capabilities in homogeneous and heterogeneous data. The self-attention module enables the capture of spatial context within the same modality data, while the cross-attention module facilitates the extraction of hybrid features from both the image and point cloud data.

II Related work

II-A Same-modality Registration

II-A1 I2I Registration

Before the age of deep-learning, hand-crafted detectors and descriptors (i.e., SIFT [13] and ORB [14]) are widely used to extract correspondences for matching. Compared to traditional methods, learning-based methods improve the robustness and accuracy of image matching with large viewpoint differences and illumination changes. SuperPoint [15] proposed a self-learning training method through homography adaptation. SuperGlue [16] proposed an attention-based graph neural network (GNN) for feature matching. Patch2Pix [17] is the first work to obtain patch matches and regress pixel-wise matches in a coarse-to-fine manner.

II-A2 P2P Registration

Point cloud registration aims to estimate the optimal rigid transformation of two point clouds. Correspondence-based methods [18, 19] estimate the correspondences first and recover the transformation with robust estimation methods [20]. RoReg [21] embeds orientation information of point cloud to estimate local orientation and refine coarse rotation through residual regression to achieve fine registration. In the transformer era, point-based transformers [22, 12] have emerged and shown great performance. CoFiNet [22] followed LoFTR’s [10] design and proposed the coarse-to-fine correspondences for registration, which computed coarse matches with descriptors strengthened by the transformer and refined coarse matches through density adaptive matching module. GeoTransformer [12] uses the Transformer to fully exploit the 3D properties of point clouds. However, I2I or P2P registration methods can not be directly used for the I2P registration, which extracts heterogeneous features from cross-modality data.

II-B Cross-modality Registration

To address the cross-modality registration problem, a variety of I2P registration methods have been proposed, which could be roughly divided into two categories: I2P fine registration (initial transformation dependent) and I2P coarse registration (initial transformation free). The I2P fine registration methods [23, 5, 24] rely on initial transform parameters and are widely applied in sensor calibration. Although this paper focuses on the second category, namely coarse I2P registration without any initial transformation knowledge, we provide a comprehensive review of registration methods in both categories.

II-B1 Fine Registration Methods

Fine registration methods have been thoroughly studied for several decades. Early-stage works [25, 26] utilize various artificial targets as calibration constraints. In recent years, some approaches have argued that structured features shared in images and point cloud could be used for target-less I2P registration, e.g., edge feature [23] and line feature [5]. [23] employs edge information and optimizes transformation according to the response value. [5] extracts lines from image and point clouds for registration. Recently, researches on 2D and 3D semantic segmentation motivate semantic feature-based I2P registration. Se-calib [24] further studies semantic edges in I2P registration to employ more common semantic information instead of a certain type of object. Overall, fine registration methods achieve high accuracy but depend on the quality of the initial values.

II-B2 Coarse Registration Methods

2D3D-Matchnet [7] is the pioneering method to regress the relative transform parameters with CNN. It extracts SIFT [13] and ISS [27] keypoints from image and point cloud and learns descriptors with a Siamese network. Experimental results show that hand-crafted detectors from different modalities match poorly. HAS-Net [28] proposed a novel network for learning cross-domain descriptors from 2D image patches and 3D point cloud volumes. However, both 2D3D-Matchnet and HAS-Net split the image and point cloud into patches and volumes, and then match with the descriptors of patches and volumes, resulting in the loss of long-range context and high-level information. DeepI2P [8] proposed a feature fusion module to merge image and point cloud information and classified points in/beyond the camera frustum. CorrI2P [9] predicted pixels and points in overlapping areas and matched with dense per-pixel/per-point features directly to get I2P correspondences. Although overlapping region detectors significantly reduce the number of false candidates, I2P registration only using the low-level feature without global guidance leads to serious mismatches. Inspired by the coarse-to-fine strategy in CoFiNet [22], we propose the CoFiI2P network, a coarse-to-fine I2P registration approach that integrates high-level correspondence information into low-level matching to effectively reject mismatches.

III Methodology

For convenient description, a pair of partially overlapped image and point cloud are defined as $\mathrm{I}\in\mathbb{R}^{W\times H\times 3}$ and $\mathrm{P}\in\mathbb{R}^{N\times 3}$ , where $W$ and $H$ are the width and height, and $N$ is the number of points. The purpose of I2P registration is to estimate the relative transformation between the image $\mathrm{I}$ and point cloud $\mathrm{P}$ , defined by a rotation matrix $\mathbf{R}\in\mathbb{R}^{3\times 3}$ and a translation vector $\mathbf{t}\in\mathbb{R}^{3}$ .

Our method adopts the coarse-to-fine manner to find the correct correspondences set $\Omega(\mathrm{I},\mathrm{P})$ and calculate the relative pose using EPnP-RANSAC [29]. The CoFiI2P mainly consists of four modules: feature extraction (FE), coarse matching (CM), fine matching (FM), and pose estimation (PE). FE is an encoder-decoder structure network, that encodes raw inputs from different modalities into shared feature space and finds in-frustum super-points. CM and FM are cascaded two-stage matching modules. CM constructs coarse matching pairs at super-pixel/super-point level, and them FM constructs fine matching pairs at pixel/point level sequentially with the guidance of super-pixel/super-point correspondences. Lastly, the PE module exploits point-pixel matching pairs to regress the relative pose with the EPnP-RANSAC algorithm. The workflow of the proposed method is shown in Fig. 2.

III-A Feature Extraction

We utilize ResNet-34 [30] and KPConv-FPN [31] as the backbones for image and point cloud to extract multi-level features. The encoder progressively embeds raw inputs into high-dimensional features, and the decoder propagates high-level features to low-level features with skip-connection for per-pixel/point feature generation. Specifically, points and pixels at the coarsest resolution ( $\frac{1}{8}$ of original resolution) ${\tilde{P}}$ and ${\tilde{I}}$ are treated as superpoints and superpixels for coarse matching, and ${P}$ and ${I}$ at the finest resolution ( $\frac{1}{2}$ of original resolution) are used for fine matching. We use $\boldsymbol{F}_{\tilde{P}}$ and $\boldsymbol{I}_{\tilde{P}}$ to denote coarse-level features, and $\boldsymbol{F}_{P}$ and $\boldsymbol{I}_{P}$ for fine-level features. For each superpoint, we construct the local points group $G_{\tilde{p}}$ with point-to-node strategy in the geometric space:

G_{\tilde{p}}=\{p\in P\lvert\|p-\tilde{p}\|<r_{g}\},

(1)

where $r_{g}$ is the chosen radius. For the superpixels, we first locate their positions on the fine-level feature map, and then crop sets of local pixel patches with window size $w\times w$ .

III-B I2P Coarse Matching

In the CM module, I2P transformer is utilized to capture the geometric and spatial consistency between image and point cloud. Each stage of the I2P transformer consists of a self-attention block for inter-modality long-range context and a cross-attention module for intra-modality feature exchange. The self-attention and cross-attention modules are repeated several times to extract well-mixed features for super-point/super-pixel correspondence matching.

III-B1 I2P Transformer

[32, 33] have shown that vision transformer (ViT) outperforms traditional CNN-based methods with a large margin in classification, detection, segmentation and other downstream tasks. Furthermore, recent approaches [11, 10, 34] have introduced transformer modules for I2I and P2P registration tasks. Therefore, we introduce the I2P transformer module customized for cross-modality registration task to enhance the representability and robustness of descriptors. Different from ViT used in the same-modality registration tasks, our I2P transformer contains both self-attention modules for space context capturing in homogeneous data and cross-attention modules for hybrid feature integration among heterogeneous data.

For the self-attention module, given a coarse-level feature map $\mathbf{F}\in\mathbb{R}^{N\times C}$ of image or point cloud, the query, key and value vectors $\mathbf{q},\mathbf{k},\mathbf{v}$ are generated as:

\mathbf{q}=\mathbf{W_{q}F},\mathbf{k}=\mathbf{W_{k}F},\mathbf{v}=\mathbf{W_{v}% F},

(2)

where $\mathbf{W}_{q},\mathbf{W}_{k}\in\mathbb{R}^{C_{qk}^{sa}\times C},\mathbf{W}_{v% }\in\mathbb{R}^{C_{v}^{sa}\times C}$ are learnable weight matrixs. Then, the global attention enhanced feature map $\mathbf{F}^{sa}$ is calculated as:

\mathbf{F}^{sa}=\mathrm{softmax}(\frac{\mathbf{q}\mathbf{k}^{\top}}{\sqrt{C}})% \mathbf{v}.

(3)

Softmax operates row-wise on the attention matrix $\mathbf{q}\mathbf{k}^{\top}$ to obtain the weights on the values. Extracted global-aware features $\mathbf{F}^{sa}$ are fed into the feed-forward network (FFN) to fuse the spatial relation information in channel dimension. Given a feature map $\mathbf{F}$ , the relative positions are encoded with multi-layer perception (MLP) [35].

Cross-attention is designed for fusing image and point cloud features in the I2P registration task. Given the feature maps $\mathbf{F}_{\tilde{P}},\mathbf{F}_{\tilde{I}}$ for super-points set $\tilde{P}$ and super-pixels set $\tilde{I}$ , cross-attention enhanced feature maps $\mathbf{F}_{\tilde{P}}^{ca}$ of point cloud and $\mathbf{F}_{\tilde{I}}^{ca}$ of image are calculated as:

\begin{split}&\mathbf{F}_{\tilde{P}}^{ca}=\mathrm{softmax}(\frac{\mathbf{q}_{% \tilde{P}}\mathbf{k}_{\tilde{I}}^{\top}}{\sqrt{C}})\mathbf{v}_{\tilde{P}},\\ &\mathbf{F}_{\tilde{I}}^{ca}=\mathrm{softmax}(\frac{\mathbf{q}_{\tilde{I}}% \mathbf{k}_{\tilde{P}}^{\top}}{\sqrt{C}})\mathbf{v}_{\tilde{I}},\end{split}

(4)

where $\mathbf{q}_{\tilde{P}},\mathbf{k}_{\tilde{P}},\mathbf{v}_{\tilde{P}}$ are query, key and value vectors of point cloud feature $\mathbf{F}_{\tilde{P}}$ , and $\mathbf{q}_{\tilde{I}},\mathbf{k}_{\tilde{I}},\mathbf{v}_{\tilde{I}}$ are query, key and value vectors of image feature $\mathbf{F}_{\tilde{I}}$ . The softmax operation is the same as in the self-attention module.

Remark 1. While the self-attention module encodes the spatial and geometric features for each super-pixel and super-point, the cross-attention module injects the geometric structure information and texture information across image and point cloud respectively. Outputs of the I2P transformer carry powerful cross-modality information for matching.

III-B2 Super-point/-pixel Matching

The field of view (FoV) of the monocular camera is obviously smaller than the laser scans of 3D Lidar (e.g., Velodyne-H64), which sweeps 360 degrees in the horizontal direction. Therefore, only a small number of super-points are in the camera frustum. To filter out beyond-frustum super-points, we add a simple binary classification head to predict super-points in or beyond the frustum of the camera. The coarse level correspondences are estimated between in-frustum super-points set $\tilde{P}$ and super-pixels set $\tilde{I}$ by finding the nearest super-pixel $\tilde{i}$ in feature space:

\tilde{\Omega}_{\mathcal{M}}=\{(\tilde{p}_{x}\in\tilde{P},\tilde{i}_{y}\in% \tilde{I})\lvert y=\mathop{\mathrm{argmin}}\limits_{\lvert\tilde{I}\rvert}\|% \mathbf{F}_{\tilde{P}}(\tilde{p}_{x})-\mathbf{F}_{\tilde{I}}(\tilde{i}_{y})\|\},

(5)

where $\mathbf{F}_{\tilde{P}}$ and $\mathbf{F}_{\tilde{I}}$ are corresponding feature maps.

III-C I2P Fine Matching

The first-stage matching at the coarse level constructs robust super-point/super-pixel pairs $\tilde{\Omega}_{\mathcal{M}}$ but leads to poor registration accuracy. To obtain high-quality I2P correspondences, we generate fine correspondences based on the coarse matching results. In each super-point/super-pixel correspondence $(\tilde{p}_{x},\tilde{i}_{y})$ , the super-point $\tilde{p}_{x}$ is expanded to a local points group $G_{\tilde{p}_{x}}$ and the super-pixel $\tilde{i}_{y}$ is expanded to a local pixels patch $G_{\tilde{i}_{y}}$ . Considering the uneven geometric distribution of point cloud and computational efficiency, only the node points in local patch groups are selected to establish correspondences. For each node point $p_{n}$ , we select the pixel $i_{k}\in G_{\tilde{i}_{y}}$ that lies nearest in the feature space. Point-pixel pairs in each super-point-to-super-pixel pair are stacked together as the fine corresponding pairs $\Omega_{\mathcal{M}}$ . With point feature map $\mathbf{F}_{G_{\tilde{p}}}$ and corresponding pixel feature map $\mathbf{F}_{G_{\tilde{i}}}$ of local patch, the fine matching process is defined as:

\begin{split}\Omega_{\mathcal{M}}=\big{\{}(p_{n}\in G_{\tilde{p}},i_{k}\in G_{% \tilde{i}})\lvert k=\mathop{\mathrm{argmin}}\limits_{\lvert G_{\tilde{i}}% \rvert}\|\mathbf{F}_{G_{\tilde{p}}}(p_{n})-\mathbf{F}_{G_{\tilde{i}}}(i_{k})\|% \big{\}}.\end{split}

(6)

III-D EPnP-RANSAC based Pose Estimation

With the precise point-pixel pairs $\Omega_{\mathcal{M}}$ , the relative transformation can be solved with EPnP [29] algorithm. As mentioned in previous approaches, wrong matching may infiltrate into the point-pixel pairs and decrease the registration accuracy. In the CoFiI2P, the EPnP-RANSAC [29, 20] algorithm is used for robustly estimating camera relative pose.

III-E Loss Function

To learn the coarse level descriptors, fine level descriptors and in/beyond-frustum super-points classification simultaneously, we introduce a joint loss $\mathcal{L}$ consisting of coarse level descriptor loss $\mathcal{L}_{coarse}$ , fine level descriptor loss $\mathcal{L}_{fine}$ and classification loss $\mathcal{L}_{classify}$ . The classification loss encourages the network to correctly label each super-point, and two descriptor losses pull positive matching pairs closer together and push negative matching pairs farther apart in the feature space.

The cosine similarity $s(p_{x},i_{y})$ of point cloud feature vector $\mathbf{F}_{p_{x}}$ and image feature vector $\mathbf{F}_{i_{y}}$ is defined as :

s(p_{x},i_{y})=\frac{<{\mathbf{F}_{p_{x}}},{\mathbf{F}_{i_{y}}}>}{\|{\mathbf{F% }_{p_{x}}}\|\|{\mathbf{F}_{i_{y}}}\|},

(7)

and the distance $d(p_{x},i_{y})$ is defined as:

d(p_{x},i_{y})=1-s(p_{x},i_{y}).

(8)

On the coarse level, the positive anchor $\tilde{i}_{pos}$ for each in-frustum super-point $\tilde{p}_{x}$ is sampled from the ground-truth pairs set $\tilde{\Omega}_{\mathcal{M}^{\star}}$ :

\tilde{\Omega}_{\mathcal{M}^{\star}}=\{(\tilde{p}_{x}\in\tilde{P},\tilde{i}_{% pos}\in\tilde{I})\lvert\tilde{i}_{pos}=\Gamma(\mathbf{T}\tilde{p}_{x})\},

(9)

where $\mathbf{T}$ is the ground-truth transform matrix from the point cloud coordinate system to the image frustum coordinate system, and $\Gamma$ represents the mapping function that converts points from the camera frustum to the image plane coordinate system. The negative anchor $\tilde{i}_{neg}$ is defined as the super-pixel with the smallest distance to the $\tilde{p}_{x}$ in the feature space:

\tilde{i}_{neg}=\mathop{\mathrm{argmin}}\limits_{\tilde{i}_{neg}\in\tilde{I}}{% \|d(\mathbf{F}_{\tilde{p}_{x}},\mathbf{F}_{\tilde{i}_{neg}})\|}\quad s.t.\quad% \|\tilde{i}_{neg}-\tilde{i}_{pos}\|>r,

(10)

where $r$ is the safe radius to remove adversarial samples. Finally, with positive margin $\Delta_{pos}$ and negative margin $\Delta_{neg}$ , coarse level descriptor loss is defined in a triplet way [28] as Eq. (11) :

\begin{split}\mathcal{L}_{coarse}=\sum_{(\tilde{p}_{x},\tilde{i}_{pos},\tilde{% i}_{neg})}\bigg{(}&\mathrm{max}\big{(}0,d(\tilde{p}_{x},\tilde{i}_{pos})-% \Delta_{pos}\big{)}+\\ &\mathrm{max}\big{(}0,\Delta_{neg}-d(\tilde{p}_{x},\tilde{i}_{neg})\big{)}% \bigg{)}.\end{split}

(11)

Fine level descriptor loss is defined as modified circle loss [36]. We randomly select $n$ in-frustum points and their positive anchor pixels and negative anchors, as Eq. (10), then the descriptor loss is defined as:

\begin{split}\mathcal{L}_{fine}=\log\bigg{(}&1+\exp\big{(}{-\gamma\alpha_{pos}% (s(p_{x},i_{pos})-\Delta_{pos})}\big{)}\\ &\sum_{(p_{x},i_{pos},i_{neg})}\exp\big{(}{\gamma{\alpha}_{neg}(p_{x},s(p_{x},% i_{pos})-\Delta_{neg})}\big{)}\bigg{)},\end{split}

(12)

where $\alpha_{neg}$ and $\alpha_{pos}$ are the dynamic optimizing rates towards negative and positive pairs, and $\gamma$ is the scale factor. As in [36], the $\alpha_{neg}$ and $\alpha_{pos}$ are defined as:

\begin{split}&\alpha_{neg}=\mathrm{max}\big{(}0,s(p_{x},i_{neg})+\Delta_{neg}% \big{)},\\ &\alpha_{pos}=\mathrm{max}\big{(}0,1+\Delta_{pos}-s(p_{x},i_{pos})\big{)}.\end% {split}

(13)

Super-points classification is defined as a binary cross-entropy loss:

\mathcal{L}_{classify}=-\sum_{\tilde{p}_{x}\in\tilde{P}}\big{(}{\hat{s}_{% \tilde{p}_{x}}}\log({s_{\tilde{p}_{x}}})+(1-{\hat{s}_{\tilde{p}_{x}}})\log{(1-% {s_{\tilde{p}_{x}}})}\big{)},

(14)

To avoid ambiguity, we reuse $s_{\tilde{p}_{x}}$ for score prediction and $\hat{s}_{\tilde{p}_{x}}$ for ground truth label. Overall, our loss function is

\mathcal{L}=\lambda_{1}\mathcal{L}_{coarse}+\lambda_{2}\mathcal{L}_{fine}+% \lambda_{3}\mathcal{L}_{classify},

(15)

where $\lambda_{1},\lambda_{2},\lambda_{3}$ are hyperparameters that control the weights between losses.

IV Experiments

IV-A Experiment Setup

IV-A1 Dataset

We evaluate our method on two public datasets, KITTI Odometry [37] and Nuscenes [38].

•

KITTI Odometry [37]: It consists of 11 sequences of images and point cloud data in urban environments with ground-truth calibration files. The camera intrinsic function $\Gamma$ extracted from calibration files is thought unbiased during experiments. Sequences 0-8 are used for training and 9-10 for testing as previous approaches [8, 9]. The image resolution is resized to 160 $\times$ 512 and the point cloud is downsampled and randomly selected to 20480.
•

Nuscenes [38]: The image and point cloud pairs are generated by the official SDK, where the point cloud is accumulated from the nearby frames and the image is from the current frame. We follow the official data split of nuScenes to utilize 850 scenes for training, and 150 scenes for testing. We downsample the image resolution to 160 $\times$ 320 and the point cloud size to 20480 similarly.

IV-A2 Baseline Methods

We compare the proposed CoFiI2P with two open-sourced I2P methods as follows.

•

DeepI2P [8]: It uses the frustum classification and inverse camera projection to estimate the camera pose. We use officially released code for reimplementation and use the 2D and 3D inverse camera projection for optimization, namely DeepI2P (2D) and DeepI2P (3D).
•

CorrI2P [9]: It is the SOTA I2P method. CorrI2P predicts the overlapping area and establishes correspondences densely for pose estimation. We use officially released code for reimplementation.

IV-A3 Evaluation Metrics

We calculate relative rotation error (RRE), relative translation error (RTE) and registration recall (RR) to evaluate the registration accuracy. Inlier ratio (IR) used in previous approaches [22, 39] is introduced to evaluate the quality of correspondences. For efficiency analysis, we report the frame-per-second (FPS) on the KITTI Odometry dataset, evaluated with a batch size of 1 on an Intel i9-13900K CPU and an NVIDIA RTX 4090 GPU. RRE and RTE are defined as :

\begin{split}\mathrm{RRE}=\sum_{i=1}^{3}\lvert\mathbf{r}(i)\rvert,\quad\mathrm% {RTE}=\|\mathbf{t}_{gt}-\mathbf{t}_{e}\|,\end{split}

(16)

where $\mathbf{r}$ is the Euler angle vector of $\mathbf{R}_{gt}^{-1}\mathbf{R}_{e}$ , $\mathbf{R}_{gt}$ and $\mathbf{t}_{gt}$ are the ground-truth rotation and translation matrix, $\mathbf{R}_{e}$ and $\mathbf{t}_{e}$ are the estimated rotation and translation matrix.

RR metric estimates the percentage of correctly matched I2P pairs, indicating the descriptor learning ability of the network. RR is defined as:

\mathrm{RR}=\mathds{1}(\mathrm{RRE}<\delta_{r}\quad\textit{and}\quad\mathrm{% RTE}<\delta_{t}),

(17)

$(\delta_{r},\delta_{t})$ is the threshold to remove false registration results, i.e. $(10^{\circ},5m)$ . IR denotes the inlier ratio of matching pairs, which measures the accuracy of correspondences. The IR for point/pixel correspondences set $\Omega_{\mathcal{M}}$ is defined as :

\mathrm{IR}_{\mathcal{M}}=\frac{1}{\lvert\Omega_{\mathcal{M}}\rvert}\sum_{(p_{% x},i_{y})\in\Omega_{\mathcal{M}}}\mathds{1}(\|\Gamma(\mathbf{T}p_{x})-i_{y}\|<% \tau),\vspace{-0.1cm}

(18)

in which $\tau$ is used to control the reprojection error tolerance.

IV-A4 Implementation Details

With correct transform parameters provided by the calibration files during the training process, the ground truth correspondences $\Omega_{\mathcal{M^{\star}}}$ are established to supervise the network. We trained the whole network 25 epochs for the KITTI Odometry dataset and 10 epochs for the Nuscenes dataset. We use the Adam [40] to optimize the network, and the initial learning rate is 0.001 and multiple by 0.25 after every 5 epochs. For our joint loss, we set $\lambda_{1}=\lambda_{2}=\lambda_{3}=1$ . The safe radius $r$ , positive margin $\Delta_{pos}$ and negative margin $\Delta_{neg}$ in loss function are set to 1, 0.2 and 1.8. Scale factor $\gamma$ is set to 10. Model configurations and more implementation details are available in our open-source code.

TABLE I: Registration accuracy on the KITTI Odometry [37] and Nuscenes [38] datasets. The results are presented in the format of "mean

\pm

standard deviation" among the test samples.

\uparrow

means higher is better and

\downarrow

means lower is better, respectively. The best results are highlighted in bold. Results of DeepI2P on the KITTI Odometry dataset are taken from [8] with identical settings.

Method	Threshold( ${}^{\circ}/m$ )	KITTI Odometry			Nuscenes			FPS $\uparrow$
Method	Threshold( ${}^{\circ}/m$ )	RRE(^∘) $\downarrow$	RTE(m) $\downarrow$	RR(%) $\uparrow$	RRE(^∘) $\downarrow$	RTE(m) $\downarrow$	RR(%) $\uparrow$	FPS $\uparrow$
DeepI2P (2D) [8]	10/5	$7.56\pm 7.63$	$3.28\pm 3.09$	-	$2.81\pm 2.23$	$1.48\pm 0.88$	95.13	1.41
DeepI2P (3D) [8]	10/5	$15.52\pm 12.73$	$3.17\pm 3.22$	-	$6.60\pm 19.33$	$1.31\pm 1.57$	38.64	0.71
CorrI2P [9]	none/none	$6.93\pm 29.03$	$2.68\pm 10.51$	-	$8.47\pm 24.02$	$5.30\pm 8.10$	-	7.95
	45/10	$3.41\pm 3.64$	$1.48\pm 1.35$	97.07	$5.42\pm 4.30$	$3.78\pm 2.24$	91.07
	10/5	$2.70\pm 1.97$	$1.24\pm 0.87$	90.66	$3.90\pm 2.23$	$2.61\pm 1.23$	61.67
CoFiI2P	none/none	$\boldsymbol{1.14}\pm\boldsymbol{0.78}$	$\boldsymbol{0.29}\pm\boldsymbol{0.19}$	-	$\boldsymbol{2.04}\pm\boldsymbol{5.54}$	$\boldsymbol{0.95}\pm\boldsymbol{5.04}$	-	15.43
	45/10	$\boldsymbol{1.14}\pm\boldsymbol{0.78}$	$\boldsymbol{0.29}\pm\boldsymbol{0.19}$	100.00	$\boldsymbol{1.88}\pm\boldsymbol{1.77}$	$\boldsymbol{0.81}\pm\boldsymbol{0.59}$	99.84
	10/5	$\boldsymbol{1.14}\pm\boldsymbol{0.78}$	$\boldsymbol{0.29}\pm\boldsymbol{0.19}$	100.00	$\boldsymbol{1.79}\pm\boldsymbol{1.22}$	$\boldsymbol{0.79}\pm\boldsymbol{0.53}$	99.18

IV-B Registration Accuracy

We report the RRE and RTE as evaluation metrics under three different settings in Table I, where $none/none$ means no specific thresholds for filtering out false registration frames and $10^{\circ}/5m$ and $45^{\circ}/10m$ mean that any frames with registration errors exceeding these specified thresholds are ignored during the evaluation process. As shown in Table I, the proposed method outperforms all baseline methods on the RRE and RTE registration metrics. Notably, our method achieves 100% RR and 99.18% RR under the hardest pair of thresholds $10^{\circ}/5m$ on the KITTI Odometry and Nuscenes datasets respectively, which indicates that our method constructs robust correspondences and achieves accurate registration performance in most of evaluation scenes. Besides, we project the points to the image plane with transform parameters provided by ground-truth files, CorrI2P [9] and CoFiI2P respectively and deploy the qualitative registration results in Fig. 3. It shows that the CoFiI2P remains stable and provides more accurate results, which confirms our claim. Although our method is slightly slower during the model inference step, the pose estimation is extremely fast due to the small quantity and high quality of matching pairs, thereby still maintaining online speed.

For the super-point frustum classification, We introduce the IR metrics to evaluate the quality of the correspondence as in the P2P registration approach [22]. IR curves in Fig. 5 indicate that our method achieves a higher proportion of correct matches among the established correspondences, which leads to better registration results. Fig. 5 intuitively illustrates that our CoFiI2P provides a much cleaner correspondence set than baseline method.

IV-C Ablation Studies and Analysis

In this part, we analyze three crucial factors in our CoFiI2P: I2P transformer module, coarse-to-fine matching scheme, and point cloud density. We conduct ablation studies on the KITTI Odometry [37] dataset to prove the effectiveness of each module and review the influence of point cloud density. We train the ablation models for 25 epochs as in the experimental part, and all other settings remain the same. We report global RRE and RTE as evaluation metrics and no thresholds are used to reject false registration scenes.

IV-C1 Analysis of the I2P Transformer

I2P transformer [41] with self-attention module and cross-attention module is crucial to image-to-point cloud alignment at the global level. In this part, we conduct ablation studies to assess the effectiveness of the I2P transformer. We train the CoFiI2P without any attention module as baseline. Then, the self-attention modules and cross-attention modules are added on the coarse level respectively. As shown in Table IV, performance of the baseline method drops significantly without any attention module. Besides, the self-attention module reduces the RRE to $1.78^{\circ}$ and the RTE about $0.41m$ , and the cross-attention module reduces the RRE to $1.74^{\circ}$ and the RTE to $0.43m$ respectively. With both the self-attention and cross-attention modules, the CoFiI2P achieves the smallest registration error. Moreover, with both self-attention and cross-attention modules, the variance reduces by a large margin, which indicates that the I2P transformer block enhances both the accuracy and robustness.

IV-C2 Analysis of the Coarse-to-fine Matching Scheme

our CoFiI2P proposes to estimate coarse correspondences at super-point/-pixel level first and then generate fine correspondences at point-pixel level sequentially. We conduct ablation experiments on the coarse-to-fine matching scheme to demonstrate that the progressive two-stage registration operates better than the one-stage registration used in previous I2P approaches. This ablation study employs the backbone with full I2P transformer blocks as baseline and evaluates the registration accuracy with only coarse or fine matching schemes. For coarse matching only, matching pairs are established on the coarse level and remapped to the fine resolution for pose estimation. By contrast, for fine matching only, matching pairs are established on the fine level directly, without guidance of coarse level correspondences. Experiment results in Table IV show that removing either the coarse matching stage or the fine matching stage leads to higher registration error and variance. We indicate that coarse-level registration provides robust correspondences and fine-level registration provides accurate matching pairs. Combining the coarse-level and fine-level registration sequentially makes it easier to access the global optimal solution in I2P registration.

IV-C3 Analysis of the Point Cloud Density

Given the significant impact of point cloud density on representative learning, we conducted ablation studies to examine this influence. The results in Table IV present registration accuracy and computational complexity for various point cloud densities. The findings reveal that as point cloud density decreases, the qualitative metrics deteriorate, while higher point cloud densities correspond to a sharp increase in computational complexity. It’s an intuitive observation that low-density point clouds lose local structured information, and high-density point clouds carry a heavy computational burden. As a result, we opt for a compromise and select 20480 points, striking a balance between efficiency and accuracy.

TABLE II: Ablation study on the I2P transformer blocks. SA denotes the self-attention module and CA denotes the cross-attention module.

Baseline	SA	CA	RRE(^∘)	RTE(m)
✓			$2.65\pm 4.79$	$0.87\pm 2.23$
✓	✓		$1.78\pm 2.28$	$0.41\pm 0.79$
✓		✓	$1.74\pm 1.50$	$0.43\pm 0.35$
✓	✓	✓	$\mathbf{1.14\pm 0.78}$	$\mathbf{0.29\pm 0.19}$

TABLE III: Ablation study on the coarse-to-fine matching scheme. CM denotes the coarse matching and FM denotes the fine matching.

Baseline	CM	FM	RRE(^∘)	RTE(m)
✓	✓		$1.35\pm 1.17$	$0.34\pm 0.26$
✓		✓	$1.47\pm 1.75$	$0.41\pm 0.63$
✓	✓	✓	$\mathbf{1.14\pm 0.78}$	$\mathbf{0.29\pm 0.19}$

TABLE IV: Ablation study on the point cloud density.

#Points	RRE(^∘)	RTE(m)	FLOPs
5120	$2.75\pm 4.28$	$0.69\pm 1.60$	37.42G
10240	$1.69\pm 1.43$	$0.39\pm 0.30$	56.00G
20480	$1.14\pm 0.78$	$0.29\pm 0.19$	93.80G
40960	$1.00\pm 0.70$	$0.26\pm 0.17$	171.91G

V Conclusion

This letter introduces CoFiI2P, a novel network designed for image-to-point cloud (I2P) registration. The proposed coarse-to-fine matching strategy first establishes robust global correspondences and then progressively refines precise local correspondences. Furthermore, the I2P transformer with self- and cross-attention modules is introduced to enhance the global-aware ability in homogeneous and heterogeneous data. Compared with existing one-stage dense prediction and matching approaches, CoFiI2P filters out a large number of false correspondences. Extensive experiments on the KITTI Odometry [37] and Nuscenes [38] dataset have demonstrated the superior accuracy, efficiency, and robustness of CoFiI2P in various environments. We hope the open-sourced CoFiI2P could benefit the relevant communities. In the near future, we will extend CoFiI2P to the unsupervised I2P registration.

References

[1] L. Wang, X. Zhang, W. Qin, X. Li, J. Gao, L. Yang, Z. Li, J. Li, L. Zhu, H. Wang et al., “Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion,” IEEE Trans. on Intelligent Transportation Systems (T-ITS), 2023.
[2] J. Li, S. Yuan, M. Cao, T.-M. Nguyen, K. Cao, and L. Xie, “Hcto: Optimality-aware lidar inertial odometry with hybrid continuous time optimization for compact wearable mapping system,” ISPRS Journal of Photogrammetry and Remote Sensing (ISPRS), vol. 211, pp. 228–243, 2024.
[3] J. Li, B. Yang, C. Chen, R. Huang, Z. Dong, and W. Xiao, “Automatic registration of panoramic image sequence and mobile laser scanning data using semantic features,” ISPRS Journal of Photogrammetry and Remote Sensing (ISPRS), vol. 136, pp. 41–57, 2018.
[4] X. Yan, J. Gao, C. Zheng, C. Zheng, R. Zhang, S. Cui, and Z. Li, “2dpass: 2d priors assisted semantic segmentation on lidar point clouds,” in Proc. of the Europ. Conf. on Computer Vision (ECCV). Springer, 2022, pp. 677–695.
[5] X. Zhang, S. Zhu, S. Guo, J. Li, and H. Liu, “Line-based automatic extrinsic calibration of lidar and camera,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2021, pp. 9347–9353.
[6] C. Yuan, X. Liu, X. Hong, and F. Zhang, “Pixel-level extrinsic self calibration of high resolution lidar and camera in targetless environments,” IEEE Robotics and Automation Letters (RA-L), vol. 6, no. 4, pp. 7517–7524, 2021.
[7] M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” in Proc. of the IEEE Intl. Conf. on Robotics & Automation (ICRA), 2019, pp. 4790–4796.
[8] J. Li and G. H. Lee, “Deepi2p: Image-to-point cloud registration via deep classification,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 960–15 969.
[9] S. Ren, Y. Zeng, J. Hou, and X. Chen, “Corri2p: Deep image-to-point cloud registration via dense correspondence,” IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 3, pp. 1198–1208, 2022.
[10] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8922–8931.
[11] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “Cotr: Correspondence transformer for matching across images,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6207–6217.
[12] Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, and K. Xu, “Geometric transformer for fast and robust point cloud registration,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 143–11 152.
[13] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Intl. Journal of Computer Vision (IJCV), vol. 60, pp. 91–110, 2004.
[14] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2011, pp. 2564–2571.
[15] DeTone, Daniel and Malisiewicz, Tomasz and Rabinovich, Andrew, “Superpoint: Self-supervised interest point detection and description,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 224–236.
[16] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4938–4947.
[17] Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2pix: Epipolar-guided pixel-level correspondences,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4669–4678.
[18] H. Deng, T. Birdal, and S. Ilic, “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors,” in Proc. of the Europ. Conf. on Computer Vision (ECCV), 2018, pp. 602–618.
[19] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser, “The perfect match: 3d point cloud matching with smoothed densities,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5545–5554.
[20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
[21] H. Wang, Y. Liu, Q. Hu, B. Wang, J. Chen, Z. Dong, Y. Guo, W. Wang, and B. Yang, “Roreg: Pairwise point cloud registration with oriented descriptors and local rotations,” IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
[22] H. Yu, F. Li, M. Saleh, B. Busam, and S. Ilic, “Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration,” Proc. of the Advances in Neural Information Processing Systems (NIPS), vol. 34, pp. 23 872–23 884, 2021.
[23] J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers.” in Proc. of Robotics: Science and Systems (RSS), 2013.
[24] Y. Liao, J. Li, S. Kang, Q. Li, G. Zhu, S. Yuan, Z. Dong, and B. Yang, “Se-calib: Semantic edges based lidar-camera boresight online calibration in urban scenes,” IEEE Trans. on Geoscience and Remote Sensing (TGRS), 2023.
[25] Q. Zhang and R. Pless, “Extrinsic calibration of a camera and laser range finder (improves camera calibration),” in Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), vol. 3, 2004, pp. 2301–2306.
[26] D. Scaramuzza, A. Harati, and R. Siegwart, “Extrinsic self calibration of a camera and a 3d laser range finder from natural scenes,” in Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2007, pp. 4164–4169.
[27] Y. Zhong, “Intrinsic shape signatures: A shape descriptor for 3d object recognition,” in Proc. of the Int. Conf. on Computer Vision Workshops (ICCVW), 2009, pp. 689–696.
[28] B. Lai, W. Liu, C. Wang, X. Bian, Y. Su, X. Lin, Z. Yuan, S. Shen, and M. Cheng, “Learning cross-domain descriptors for 2d-3d matching with hard triplet loss and spatial transformer network,” in Image and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6–8, 2021, Proceedings, Part III 11. Springer, 2021, pp. 15–27.
[29] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the p n p problem,” Intl. Journal of Computer Vision (IJCV), vol. 81, pp. 155–166, 2009.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[31] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2019, pp. 6411–6420.
[32] W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, “Topformer: Token pyramid transformer for mobile semantic segmentation,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 083–12 093.
[33] Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2023.
[34] H. Yu, Z. Qin, J. Hou, M. Saleh, D. Li, B. Busam, and S. Ilic, “Rotation-invariant transformer for point cloud matching,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5384–5393.
[35] K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improving relative position encoding for vision transformer,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2021, pp. 10 033–10 041.
[36] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6398–6407.
[37] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.
[38] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
[39] B. Wang, C. Chen, Z. Cui, J. Qin, C. X. Lu, Z. Yu, P. Zhao, Z. Dong, F. Zhu, N. Trigoni et al., “P2-net: Joint description and detection of local features for pixel and point matching,” in Proc. of the IEEE/CVF Intl. Conf. on Computer Vision (ICCV), 2021, pp. 16 004–16 013.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2015.
[41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2021.