\useunder

\ul

MonoCD: Monocular 3D Object Detection with Complementary Depths

Longfei Yan1   Pei Yan1   Shengzhou Xiong1   Xuanyu Xiang1   Yihua Tan1
1Hubei Engineering Research Center of Machine Vision and Intelligent Systems,
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China
{longfeiyan, yanpei}@hust.edu.cn, [email protected], {xuanyuxiang, yhtan}@hust.edu.cn
Corresponding author.
Abstract

Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the similarity of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.

1 Introduction

Refer to caption
Figure 1: (a) Comparison of coupling(coup) and complementary(comp) multi-depth with two depth branches Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where Zsuperscript𝑍Z^{*}italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Zsoftsubscript𝑍𝑠𝑜𝑓𝑡Z_{soft}italic_Z start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT represents the ground truth of the depth and the final combined depth respectively. (b) A complementary demonstration of the two depth branches with the help of geometrical relations when considering only the inaccurate estimation of the object 3D height H𝐻Hitalic_H. Both Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT generated by the widely used local height clue and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT generated by our newly introduced global clue yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT are related to H𝐻Hitalic_H. Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG denote the ground truth of H𝐻Hitalic_H and the underestimated H𝐻Hitalic_H respectively.

As a significant research topic in both academia and industry, 3D object detection can empower non-human intelligences to perceive the 3D world. Compared with LiDAR-based [14, 30, 31, 37] and stereo-based [16, 33, 26, 15] approaches, monocular 3D object detection has attracted widespread attention due to its lower price and simpler configuration [18, 27]. However, its 3D localization accuracy is significantly lower than those based on LiDAR and stereo. To advance and promote automation technologies, such as autonomous driving and robotics, it is essential to enhance the 3D localization precision of monocular 3D object detection.

Recently, many monocular 3D object detection algorithms have realized that the main reason limiting the 3D localization precision of monocular 3D object detection is inaccurate depth estimation [28, 48, 25, 43, 18]. Following mainstream CenterNet paradigm [45], they explore multiple local depth clues and formulate depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. For instance, MonoFlex [43] explores local depth clues from direct estimate and object heights, and subsequently combines them into one depth by weighted averaging. MonoDDE [18] further reveals clues from the object perspective point on top of that.

However, experiments on KITTI dataset [9] show that 95% of the existing multi-depth prediction ensembles have the same error sign, i.e., multiple predicted depths are usually distributed on the same side of the ground truth as shown by the coupling in Fig. 1(a), which leads to depth errors that cannot be neutralized with each other, hindering the improvement of combined depth accuracy. We attribute this coupling phenomenon to the fact that the local depth clues they used are all derived from the same local features around the object in the CenterNet paradigm.

In this paper, we propose to increase the complementarity of depths to alleviate the problem. Complementarity here refers that these predictions not only aim for high accuracy but also have different error signs. To this end, we propose two novel designs. First, considering the aforementioned coupling phenomenon, we add a new depth prediction branch that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the similarity of depth predictions. It relies on the global information that all objects in one image approximately lie on the same plane. Second, to further improve complementarity, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form, which utilizes the fact that errors in the same geometric quantity may have opposite effects on different branches. For example, in Fig. 1(b), Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has a negative error because the related clue 3D height H𝐻Hitalic_H is underestimated, whereas in this case, Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has a positive error because the effect of H𝐻Hitalic_H on Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT combined with new clues yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT is opposite to Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Therefore, the geometric relation based on H𝐻Hitalic_H provides complementarity to Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in form.

Incorporating all the designs, we propose a novel monocular 3D detector with complementary depths, named MonoCD, which compensates for the complementarity neglected in previous multi-depth predictions. The main contributions of this paper are summarized as follows:

  • We point out the coupling of existing monocular object depth predictions, which limits the accuracy of the combined depths. Therefore we propose to improve the depths complementarity to alleviate this problem.

  • We propose to add a new depth prediction branch named complementary depth that utilizes global and efficient depth clue and fully exploit the geometric relations between multiple depth clues to achieve complementarity in form.

  • Evaluated on KITTI benchmark, our method achieves state-of-the-art performance without introducing extra data. Moreover, complementary depth can be a lightweight plug-and-play module to boost multiple existing detectors.

2 Related work

2.1 Center-based Monocular 3D Detector

Many recent works [39, 7, 46, 23, 19, 44] are extended from the popular center-based paradigm CenterNet [45], which is an anchor-free method initially applied to 2D object detection. It makes the detection process simpler and more efficient due to converting all attributes of a 3D bounding box into a center to estimate. SMOKE [21] inherits the center-based framework and proposes that the estimation of the 2D bounding box can be omitted. MonoDLE [24] finds that the estimation of the 2D bounding box contributes to the prediction of 3D attributes and demonstrates that depth error is the main reason limiting the accuracy of monocular 3D object detection. MonoCon [20] finds that adding auxiliary learning tasks around the center can improve the generalization performance. Although there are many benefits in the center-based framework, it makes the prediction of all 3D attributes highly correlated with the local center. It ignores the exploitation of global information, leading to the coupling of predicted 3D attributes.

2.2 Transformer-based Monocular 3D Detector

Benefiting from the non-local encoding of attention mechanism [35] and its development in object detection [4], multiple Transformer-based monocular 3D detectors have recently been proposed to enhance the global perception capability. MonoDTR [10] proposes to perform depth position encoding to inject global depth information into Transformer to guide the detection, which requires LIDAR for auxiliary supervision. Different from it, MonoDETR [42] uses foreground object labels to predict foreground Depth Maps to achieve depth guidance. In order to improve the inference efficiency, MonoATT [47] proposes an adaptive token Transformer and makes it possible for finer tokens to be assigned to more significant regions in images. Although the above methods perform well, the drawbacks of high computational complexity and slow inference of Transformer-based monocular 3D detectors are still apparent. Thus there is currently a lack of a method that has both the capability of synthesizing global information and low latency in real-world autonomous driving scenarios.

2.3 Estimation of Multi-Depth

In addition to directly estimating object depth using deep neural networks, many recent works have broadened the depth estimation branch by mediately predicting geometric clues associated with depth. [32, 23] utilizes mathematical priors and uncertainty modeling to restore depth information through the ratio of 3D to 2D height. Based on them, MonoFlex [43] further extends the geometric depths to three sets by other supporting lines of the 3D bounding box and proposes to use uncertainties as weights to combine multiple depths into a final depth. MonoGround [28] introduces a local ground plane prior and enriches the depth supervision sources using randomly sampling dense points in the bottom plane of each object. MonoDDE [18] utilizes keypoint information to expand the number of depth prediction branches to 20, highlighting the importance of depth diversity. However, the complementarity between multiple depths is hardly explored. Errors in geometric clues (such as 2D/3D height) accumulate into the corresponding depth errors. Without effective complementarity, existing depth errors cannot be neutralized.

3 Approach

Refer to caption
Figure 2: Overview of the approach. The input image is first subjected to processing by a feature extraction network and subsequently directed into multiple prediction heads. The prediction heads are divided into two parts. The upper orange section is used to predict the global horizon heatmap of the image, serving as a global clue to generate the prediction of complementary depths (zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT). The lower blue section, after predicting local information for each point of interest, further generates keypoint depths (zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT) and direct depth (zdirsubscript𝑧𝑑𝑖𝑟z_{dir}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT). Finally, the three depth prediction branches are weighted and combined using simultaneously predicted uncertainties to obtain the final depth estimation.

3.1 Problem Definition

The task of monocular 3D object detection is to recognize objects of interest from a 2D image only and predict their corresponding 3D attributes including 3D location (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ), dimension (h,w,l)𝑤𝑙(h,w,l)( italic_h , italic_w , italic_l ), and orientation θ𝜃\thetaitalic_θ. The 3D location (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) is usually transformed into 2.5D information (uc,vc,z)subscript𝑢𝑐subscript𝑣𝑐𝑧(u_{c},v_{c},z)( italic_u start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z ) for prediction. The recovery process of x𝑥xitalic_x and y𝑦yitalic_y can be formulated as:

x=(uccu)zfx,y=(vccv)zfyformulae-sequence𝑥subscript𝑢𝑐subscript𝑐𝑢𝑧subscript𝑓𝑥𝑦subscript𝑣𝑐subscript𝑐𝑣𝑧subscript𝑓𝑦x=\frac{(u_{c}-c_{u})z}{f_{x}},\quad y=\frac{(v_{c}-c_{v})z}{f_{y}}italic_x = divide start_ARG ( italic_u start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_z end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , italic_y = divide start_ARG ( italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_z end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG (1)

where (uc,vc)subscript𝑢𝑐subscript𝑣𝑐(u_{c},v_{c})( italic_u start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is the projected 3D center in the image and (cu,cv)subscript𝑐𝑢subscript𝑐𝑣(c_{u},c_{v})( italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is the camera optical center. fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and fysubscript𝑓𝑦f_{y}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denote the horizontal and vertical focal lengths respectively.

As described in Sec. 1, many methods [43, 28, 18] have realized that depth z𝑧zitalic_z is the main reason limiting the performance of monocular 3D detector and utilize multi-depth to improve the accuracy of depth prediction via:

zsoft=i=1nwizisubscript𝑧𝑠𝑜𝑓𝑡superscriptsubscript𝑖1𝑛subscript𝑤𝑖subscript𝑧𝑖z_{soft}=\displaystyle\sum_{i=1}^{n}w_{i}z_{i}italic_z start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (2)

where {zi}i=1nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑛\{z_{i}\}_{i=1}^{n}{ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents n predicted depths and {wi}i=1nsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑛\{w_{i}\}_{i=1}^{n}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents their weights determined by the predicted uncertainty [11, 12]. zsoftsubscript𝑧𝑠𝑜𝑓𝑡z_{soft}italic_z start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT is used as the final depth of the output.

3.2 The Effect of Complementary Depths

To demonstrate the effectiveness of complementary depths, we present its superiority from a mathematical perspective. Define two different depth prediction branches z^1subscript^𝑧1\hat{z}_{1}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z^2subscript^𝑧2\hat{z}_{2}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as follows:

z^1=z+e1,z^2=z+e2formulae-sequencesubscript^𝑧1superscript𝑧subscript𝑒1subscript^𝑧2superscript𝑧subscript𝑒2\hat{z}_{1}=z^{*}+e_{1},\quad\hat{z}_{2}=z^{*}+e_{2}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (3)

where zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the ground truth of depth. e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and e2subscript𝑒2e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the errors of the two depth branches in a single prediction, respectively. Note that the positive and negative of e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and e2subscript𝑒2e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT correspond to the sign of error. We define e1e2>0subscript𝑒1subscript𝑒20e_{1}e_{2}>0italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 to simulate the case of multiple depth coupling, as shown in Fig. 1(a). We term the final combination error of multiple coupling depths as coupling depth error. Hence, referring to Eq. 2, the coupling depth error E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of z^1subscript^𝑧1\hat{z}_{1}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z^2subscript^𝑧2\hat{z}_{2}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be formulated as:

E1subscript𝐸1\displaystyle E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =|w1z^1+w2z^2z|absentsubscript𝑤1subscript^𝑧1subscript𝑤2subscript^𝑧2superscript𝑧\displaystyle=|w_{1}\hat{z}_{1}+w_{2}\hat{z}_{2}-z^{*}|= | italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | (4)
=|w1e1+w2e2|absentsubscript𝑤1subscript𝑒1subscript𝑤2subscript𝑒2\displaystyle=|w_{1}e_{1}+w_{2}e_{2}|= | italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |

where w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfy w1,w2>0subscript𝑤1subscript𝑤20w_{1},w_{2}>0italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and w1+w2=1subscript𝑤1subscript𝑤21w_{1}+w_{2}=1italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1. We then flip z^1subscript^𝑧1\hat{z}_{1}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT symmetrically along zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT without changing the accuracy of the prediction through:

z^1superscriptsubscript^𝑧1\displaystyle\hat{z}_{1}^{\prime}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =z(z^1z)absentsuperscript𝑧subscript^𝑧1superscript𝑧\displaystyle=z^{*}-(\hat{z}_{1}-z^{*})= italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (5)
=ze1absentsuperscript𝑧subscript𝑒1\displaystyle=z^{*}-e_{1}= italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

After flipping, the error sign in z^1superscriptsubscript^𝑧1\hat{z}_{1}^{\prime}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and z^2subscript^𝑧2\hat{z}_{2}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are opposite and higher complementarity between them is artificially achieved. We term the final combination error of multiple complementary depths as complementary depth error. Similarly, the complementary depth error E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of z^1superscriptsubscript^𝑧1\hat{z}_{1}^{\prime}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and z^2subscript^𝑧2\hat{z}_{2}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be formulated as:

E2subscript𝐸2\displaystyle E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =|w1z^1+w2z^2z|absentsubscript𝑤1superscriptsubscript^𝑧1subscript𝑤2subscript^𝑧2superscript𝑧\displaystyle=|w_{1}\hat{z}_{1}^{\prime}+w_{2}\hat{z}_{2}-z^{*}|= | italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | (6)
=|w1e1w2e2|absentsubscript𝑤1subscript𝑒1subscript𝑤2subscript𝑒2\displaystyle=|w_{1}e_{1}-w_{2}e_{2}|= | italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |

By mathematical transformations we further express Eqs. 4 and 6 as:

E1subscript𝐸1\displaystyle E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =(w1e1+w2e2)2absentsuperscriptsubscript𝑤1subscript𝑒1subscript𝑤2subscript𝑒22\displaystyle=\sqrt{(w_{1}e_{1}+w_{2}e_{2})^{2}}= square-root start_ARG ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (7)
=(w1e1)2+2w1w2e1e2+(w2e2)2absentsuperscriptsubscript𝑤1subscript𝑒122subscript𝑤1subscript𝑤2subscript𝑒1subscript𝑒2superscriptsubscript𝑤2subscript𝑒22\displaystyle=\sqrt{(w_{1}e_{1})^{2}+2w_{1}w_{2}e_{1}e_{2}+(w_{2}e_{2})^{2}}= square-root start_ARG ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
E2subscript𝐸2\displaystyle E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(w1e1w2e2)2absentsuperscriptsubscript𝑤1subscript𝑒1subscript𝑤2subscript𝑒22\displaystyle=\sqrt{(w_{1}e_{1}-w_{2}e_{2})^{2}}= square-root start_ARG ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (8)
=(w1e1)22w1w2e1e2+(w2e2)2absentsuperscriptsubscript𝑤1subscript𝑒122subscript𝑤1subscript𝑤2subscript𝑒1subscript𝑒2superscriptsubscript𝑤2subscript𝑒22\displaystyle=\sqrt{(w_{1}e_{1})^{2}-2w_{1}w_{2}e_{1}e_{2}+(w_{2}e_{2})^{2}}= square-root start_ARG ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
Refer to caption
Figure 3: Evaluation of complementary effect on the KITTI validation set. The metric is AP40𝐴subscript𝑃40AP_{40}italic_A italic_P start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT for the moderate Car category at 0.7 IoU threshold. Left: Different proportions of flipped samples achieve different levels of complementarity. Right: Fixing the proportion of flipped samples to 50% and applying random disturbances of different magnitudes to the flipped depth branch.

It is obvious that the complementary depth error E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is consistently less than the coupling depth error E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT due to the condition e1e2>0subscript𝑒1subscript𝑒20e_{1}e_{2}>0italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0. Regardless of variations in weight or error magnitude, this relationship remains constant. Similarly, the conclusion is equivalent by maintaining z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT unchanged during the flip of z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore we can draw the conclusion: realizing the complementary relationship between two depth branches contributes to reducing the overall depth error, even without improving the accuracy of individual branches.

To demonstrate the effectiveness of complementary depths in practice, we select a classical multi-depth prediction baseline [43] for evaluation in KITTI val set. It contains 4 depth prediction branches (1 directly estimated depth and 3 geometric depths) and the coupling rate of any two branches is around 95% after testing. As shown on the left in Fig. 3, we flip the direct depth estimation branch among them symmetrically along the ground truth based on Eq. 5 across a 0% to 100% sample scale to achieve depths complementary at different levels. Additionally, considering the difficulty of obtaining depth predictions with opposite error signs while maintaining the same accuracy in practice, we conduct another experiment by flipping the depth branch while applying random disturbances of different magnitudes on top of it. The results are presented on the right of Fig. 3. Similar results are observed in other branches by performing the same operation as above. Based on this, we have the following three observations:

Observation 1: On the left of Fig. 3, the detection accuracy increases as the proportion of flipped samples rises. It demonstrates that increasing complementarity between multiple depth prediction branches can improve detection accuracy continuously.

Observation 2: For two independent depth prediction branches, ideally, the proportion of their predictions with opposite signs in all samples should be 50%. The situation is similar to the 50% flipped proportion on the left of Fig. 3 due to the coupling of multiple branches in the baseline. Therefore reducing the similarity of multiple depth prediction branches can also increase their complementarity.

Observation 3: In the case where the flipped proportion is fixed at 50%, as shown in the right of Fig. 3, it is not until the application of random disturbance with an amplitude of 2 meters (which is quite significant [24] for Car in KITTI) that the complementary effect disappeared. This indicates that complementary effect can still contribute to overall performance even if losing some depth estimation accuracy and ultimately whether the overall performance can be improved depends on both the proportion of opposite signs and the depth estimation accuracy.

Additionally, we select models with different total numbers of depth prediction branches to perform flipping and evaluation. We find that as the number of flipped branches approaches the number of unflipped branches, the overall performance improves accordingly. For more experiments and details, please refer to the supplementary materials.

3.3 3D Detector with Complementary Depths

Framework Overview.  As shown in Fig. 2, the network we design extends from CenterNet [45]. The regression heads are divided into two parts: local clues and global clues, where DLA-34 [41] is chosen as the backbone of the network. The branch of local clues is designed with reference to MonoFlex [43], which estimates dimension, keypoints, direct depth, orientation, and 2D detection for each local peak point based on the predicted Heatmap. Since the prediction of these geometric quantities is highly correlated with the position of the local peak point in the image, they are referred to as local clues. Both zdirsubscript𝑧𝑑𝑖𝑟z_{dir}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT and zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT are derived from them. The branch of global clues predicts the Horizon Heatmap of the entire image based on all extracted pixel features, which is used to obtain the trend of yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT in scenes, and then outputs the complementary depth zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT embedding the global clues. How to construct a depth prediction branch with the global clues and further achieve complementarity in form will be elaborated below. Following [11, 12], we model uncertainty for all seven depth predictions (1 direct depth, 3 keypoint depths, and 3 complementary depths augmented by diagonal columns as [43]). The final depth is obtained according to Eq. 2, with wi=1σisubscript𝑤𝑖1subscript𝜎𝑖w_{i}=\frac{1}{\sigma_{i}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG.

Depth Prediction with Global Clues.  Inspired by [8], the neural network sees depth from a single image through:

z=fyyvbcv𝑧subscript𝑓𝑦𝑦subscript𝑣𝑏subscript𝑐𝑣z=\frac{{f}_{y}y}{v_{b}-c_{v}}italic_z = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_y end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG (9)

where y𝑦yitalic_y denotes the y𝑦yitalic_y-axis coordinates of the object in the camera coordinate system, and vbsubscript𝑣𝑏v_{b}italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the vertical coordinate of the projected bottom center in the pixel coordinate system. Considering that y𝑦yitalic_y also represents the elevation of the plane in which the objects are located and that all objects lie approximately in one plane, y𝑦yitalic_y contains such a global characteristic and can be distinguished from other depth clues. Unlike previous neural networks that implicitly utilize Eq. 9, we propose to predict y𝑦yitalic_y explicitly.

To avoid falling into the coupling, we do not utilize the center-based approach discussed in Sec. 2.1 to predict y𝑦yitalic_y. We propose to first obtain the sloping trend of y𝑦yitalic_y in the scene by the ground plane equation. The prediction of the ground plane equation is based on the Horizon Heatmap branch, similar to [38], but we omit the edge prediction and obtain prediction results as:

Ax+By+Cz+1.65=0𝐴𝑥𝐵𝑦𝐶𝑧1.650\displaystyle Ax+By+Cz+1.65=0italic_A italic_x + italic_B italic_y + italic_C italic_z + 1.65 = 0 (10)
s.t.A2+B2+C2=1\displaystyle s.t.\quad A^{2}+B^{2}+C^{2}=1italic_s . italic_t . italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1

where A=Fkhfxfy𝐴𝐹subscript𝑘subscript𝑓𝑥subscript𝑓𝑦A=F\frac{k_{h}f_{x}}{f_{y}}italic_A = italic_F divide start_ARG italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG, B=F𝐵𝐹B=-Fitalic_B = - italic_F and C=Fkhcu+bhcvfy𝐶𝐹subscript𝑘subscript𝑐𝑢subscript𝑏subscript𝑐𝑣subscript𝑓𝑦C=F\frac{k_{h}c_{u}+b_{h}-c_{v}}{f_{y}}italic_C = italic_F divide start_ARG italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG. khsubscript𝑘k_{h}italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and bhsubscript𝑏b_{h}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT represent the slope and intercept of the horizon fitted by Horizon Heatmap. After it, then considering Eq. 1 and the projected bottom center (ub,vb)subscript𝑢𝑏subscript𝑣𝑏(u_{b},v_{b})( italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) of the object, y𝑦yitalic_y with global information can be derived as:

yglo=1.65An+Cm+Bsubscript𝑦𝑔𝑙𝑜1.65𝐴𝑛𝐶𝑚𝐵{y}_{glo}=-\frac{1.65}{An+Cm+B}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT = - divide start_ARG 1.65 end_ARG start_ARG italic_A italic_n + italic_C italic_m + italic_B end_ARG (11)

where n=fy(ubcu)fx(vbcv)𝑛subscript𝑓𝑦subscript𝑢𝑏subscript𝑐𝑢subscript𝑓𝑥subscript𝑣𝑏subscript𝑐𝑣n=\frac{f_{y}(u_{b}-c_{u})}{f_{x}(v_{b}-c_{v})}italic_n = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_ARG, m=fyvbcv𝑚subscript𝑓𝑦subscript𝑣𝑏subscript𝑐𝑣m=\frac{f_{y}}{v_{b}-c_{v}}italic_m = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG.

Inserting Eq. 11 into Eq. 9, a new depth prediction branch with the global clue is obtained:

zglo=fyyglovbcvsubscript𝑧𝑔𝑙𝑜subscript𝑓𝑦subscript𝑦𝑔𝑙𝑜subscript𝑣𝑏subscript𝑐𝑣z_{glo}=\frac{{f}_{y}y_{glo}}{v_{b}-c_{v}}italic_z start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG (12)

In addition, to better utilize the global features as well as to expand the receptive field, we use dilated convolution [40] to predict the Horizon Heatmap.

Refer to caption
Figure 4: Geometric correspondence of different depths. To avoid overlap, the geometric correspondences of zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT are marked with orange and blue lines, respectively.

Complementary Form in Solving.  Simply achieving more independent depth prediction is not enough, we hope to fully exploit the geometric relations between multiple depth prediction branches to improve complementarity further. Considering the projected bottom center (ub,vb)subscript𝑢𝑏subscript𝑣𝑏(u_{b},v_{b})( italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) and top center (ut,vt)subscript𝑢𝑡subscript𝑣𝑡(u_{t},v_{t})( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), as shown in the orange part of Fig. 4, the depth derived from keypoint and height in [32] can be rewritten as:

zkey=fyHvbvtsubscript𝑧𝑘𝑒𝑦subscript𝑓𝑦𝐻subscript𝑣𝑏subscript𝑣𝑡z_{key}=\frac{f_{y}H}{v_{b}-v_{t}}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_H end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (13)

where H𝐻Hitalic_H represents the 3D height of the object. Combining the global yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT information obtained by Eq. 11 and the geometric quantities used in Eq. 13, we further propose a depth prediction that is complementary to zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT in form:

zcomp=fy(yglo12H)12(vb+vt)cvsubscript𝑧𝑐𝑜𝑚𝑝subscript𝑓𝑦subscript𝑦𝑔𝑙𝑜12𝐻12subscript𝑣𝑏subscript𝑣𝑡subscript𝑐𝑣z_{comp}=\frac{f_{y}(y_{glo}-\frac{1}{2}H)}{\frac{1}{2}(v_{b}+v_{t})-c_{v}}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_H ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG (14)

The geometric correspondence is shown in the blue part of Fig. 4. It can be observed that the signs of H𝐻Hitalic_H and vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the designed Eq. 14 are exactly opposite to those in Eq. 13. This means that the errors of H𝐻Hitalic_H and vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT have opposite effects on zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT during the prediction of 3D information for each object. Although Eq. 13 and Eq. 14 are not strictly symmetrical, this further increases the probability that the errors ekeysubscript𝑒𝑘𝑒𝑦e_{key}italic_e start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and ecompsubscript𝑒𝑐𝑜𝑚𝑝e_{comp}italic_e start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT of zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT satisfy the condition of ekeyecomp<0subscript𝑒𝑘𝑒𝑦subscript𝑒𝑐𝑜𝑚𝑝0e_{key}e_{comp}<0italic_e start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT < 0. As proved by Sec. 3.2, eventually a part of the depth error is neutralized in the weighted averaging of Eq. 2.

4 Experiments

Methods, Venues Extra data Test, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT Test, APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT Time(ms)
Eazy Mod. Hard Eazy Mod. Hard
DDMP-3D [36], CVPR2021 Depth 19.71 12.78 9.80 28.08 17.89 13.44 180
Kinematic3D [2], ECCV2020 Video 19.07 12.72 9.17 26.69 17.52 13.10 120
AutoShape [22], ICCV2021 CAD 22.47 14.17 11.36 30.66 20.08 15.59 50
DCD [17], ECCV2022 23.81 15.90 13.21 32.55 21.50 18.25 -
MonoRUn [5], CVPR2021 LiDAR 19.65 12.30 10.58 27.94 17.34 15.24 70
CaDDN [29], CVPR2021 19.17 13.41 11.46 27.94 18.91 17.19 630
MonoDTR [10], CVPR2022 21.99 15.39 12.73 28.59 20.38 17.14 37
SMOKE [21], CVPRW2020 None 14.03 9.76 7.84 20.83 14.49 12.75 30
MonoDLE [24], CVPR21 17.23 12.26 10.29 24.79 18.89 16.00 40
MonoRCNN [32], ICCV2021 18.36 12.65 10.03 25.48 18.11 14.10 70
MonoFlex [43], CVPR2021 19.94 13.89 12.07 28.23 19.75 16.89 35
MonoGround [28], CVPR2022 21.37 14.36 12.62 30.07 20.47 17.74 30
GPENet [38], - 22.41 15.44 12.84 30.31 20.79 18.21 -
MonoJSG [19], CVPR2022 24.69 16.14 13.64 32.59 21.26 18.18 42
MonoCon [20], AAAI2022 22.50 16.46 \ul13.95 31.12 22.10 \ul19.00 25.8
MonoDETR [42], ICCV2023 \ul25.00 \ul16.47 13.58 33.60 \ul22.11 18.60 43
MonoCD(Ours) None 25.53 16.59 14.53 \ul33.41 22.81 19.57 36
Improvement v.s. second-best +0.53 +0.12 +0.58 -0.19 +0.70 +0.57 -
Table 1: Comparison with current state-of-the-art methods on Car category on the KITTI test set. Methods are grouped according to extra data. Follow [9], the methods in each group are sorted by AP3D𝐴subscript𝑃3𝐷{AP}_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT performance in Moderate difficulty setting. We bold the best results and \ulunderline the second results.

4.1 Dataset

Our experiments are conducted on the widely-adopted KITTI 3D Object [9] dataset, which contains 7481 training images and 7518 test images. Since the annotations of the test images are not publicly accessible, we follow [6] and further divide the 7481 training images into 3712 and 3769 as the training and validation sets, respectively. Each category is further refined into three difficulties: Easy, Moderate, and Hard based on 2D height, truncation, and occlusion.

4.2 Evaluation Metrics

As in previous methods, we use Average Precision AP3D𝐴subscript𝑃3𝐷{AP}_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT and APBEV𝐴subscript𝑃𝐵𝐸𝑉{AP}_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT as the overall evaluation metrics. Following [34], 40 recall positions are used for the above AP calculations. The IoU threshold is 0.7 for Car.

In the ablation study of Sec. 4.5, the mean absolute error (MAE) of y𝑦yitalic_y is introduced as a metric to evaluate the accuracy of the different y𝑦yitalic_y sources. In addition, to better measure the complementarity between different designs, we quantify the magnitude of complementarity as the Complementarity Score. As discussed in Sec. 3.2, both the error sign opposite proportion and depth estimation accuracy are crucial in achieving enhanced performance. Thus we formulate the Complementarity Score(CS) as:

CS=ESOPzMAEz𝐶𝑆𝐸𝑆𝑂subscript𝑃𝑧𝑀𝐴subscript𝐸𝑧CS=\frac{ESOP_{z}}{{MAE}_{z}}italic_C italic_S = divide start_ARG italic_E italic_S italic_O italic_P start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG italic_M italic_A italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG (15)

where ESOPz𝐸𝑆𝑂subscript𝑃𝑧ESOP_{z}italic_E italic_S italic_O italic_P start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represents depths Error Sign Opposite Proportion (ESOP) between global and local clue branches, and MAEz𝑀𝐴subscript𝐸𝑧{MAE}_{z}italic_M italic_A italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represents the Mean Absolute Error of zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT. For a baseline without zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT, ESOP counts the proportion between zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and zdirsubscript𝑧𝑑𝑖𝑟z_{dir}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT.

4.3 Implementation Details

In order to demonstrate the effectiveness of the proposed framework, we choose three recent center-based methods with excellent performance as the baseline model, MonoFlex [43], MonoDLE [24], and MonoCon [20]. All experiments are performed on a single RTX 2080Ti GPU. The aforementioned baseline models all employ DLA-34 [41] as the feature extraction network. In the Global Clues branch, the prediction head of Horizon Heatmap contains two 3×3 conv layers with BN and ReLU (where the dilation rate is set to 2) and an output conv layer. The horizon equation is obtained by taking out all the largest elements in each column of the Horizon heatmap and fitting them. The ground truth of Horizon Heatmap is generated by fitting the scene ground plane through the bottom coordinate annotation of each object and then projecting to the 2D image plane [38], so only RGB image data and camera annotations are used throughout the training process. The radius of the Gaussian kernel used for each pixel is 2 when mapping the horizon equation into Heatmap. The zdirectsubscript𝑧𝑑𝑖𝑟𝑒𝑐𝑡z_{direct}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r italic_e italic_c italic_t end_POSTSUBSCRIPT, zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT loss weight proportions are set to 1:0.2:0.1:10.2:0.11:0.2:0.11 : 0.2 : 0.1. The remaining settings such as optimizer, batch sizes, image padding size, etc. remain consistent with the baseline.

4.4 Quantitative Results

To demonstrate the effectiveness of the proposed method, we conduct quantitative experiments on test and val sets of KITTI [9].

As shown in Tab. 6, the proposed method is compared with the state-of-the-art methods in recent years on the widely used KITTI test set. Our method achieves the best performance in the majority of metrics without using any additional data. Compared with the previous multi-depth solving method MonoFlex [43], our performance for AP3D/APBEV𝐴subscript𝑃3𝐷𝐴subscript𝑃𝐵𝐸𝑉{AP}_{3D}/{AP}_{BEV}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT / italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT improves by 19.44%/15.49%, respectively. The performance for AP3D/APBEV𝐴subscript𝑃3𝐷𝐴subscript𝑃𝐵𝐸𝑉{AP}_{3D}/{AP}_{BEV}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT / italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT improves from 15.44/20.79 to 16.59/22.81 compared to the method GPENet [38], which also incorporated the ground plane equation solution. Even when compared to the latest Transformer-based detector MonoDETR [42], we outperform it in most metrics while ensuring real-time operation.

Val, APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT Val, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT
Method Eazy Mod. Hard Eazy Mod. Hard
MonoDLE [24] 24.97 19.33 17.01 17.45 13.66 11.68
+ Ours 26.84 20.86 17.89 18.60 15.09 12.86
Improvement +1.87 +1.53 +0.88 +1.15 +1.43 +1.18
MonoFlex [43] 30.51 23.16 19.87 23.64 17.51 15.14
+ Ours 31.49 23.56 20.12 24.22 18.27 15.42
Improvement +0.98 +0.40 +0.25 +0.58 +0.76 +0.28
MonoCon [20] 33.36 24.39 21.03 26.33 19.01 15.98
+ Ours 34.60 24.96 21.51 26.45 19.37 16.38
Improvement +1.24 +0.57 +0.48 +0.12 +0.36 +0.40
Table 2: In order to fully demonstrate the effectiveness of the proposed method, we extend complementary depth to three center-based monocular 3D detectors. Evaluation is performed on the KITTI val set. The increased performance is highlighted in blue.

As shown in Tab. 7, we extend the complementary depth branch to three competitive center-based monocular 3d detectors. The results of the KITTI val set demonstrate that the proposed complementary depth is flexible and achieves stable increments across multiple frameworks and metrics. It is worth noting that the boost of our design performs better on APBEV𝐴subscript𝑃𝐵𝐸𝑉{AP}_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT than AP3D𝐴subscript𝑃3𝐷{AP}_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT in general. We attribute this to the focus of our method on improvements in depth estimation, since APBEV𝐴subscript𝑃𝐵𝐸𝑉{AP}_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT is more emphasis on the accuracy of localization along the Z-axis compared to AP3D𝐴subscript𝑃3𝐷{AP}_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT [9].

4.5 Ablation Study

Setting Val, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT y𝑦yitalic_y MAE zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT MAE ESOP (%) CS\uparrow
Eazy Mod. Hard
Baseline 23.64 17.51 15.14 - - 4.08 -
Baseline+lo. 18.41 13.49 10.90 0.127 4.03 18.63 4.62
Baseline+fi. 21.93 15.86 13.22 0.250 8.47 45.72 5.40
Baseline+gl. 22.97 17.85 15.11 0.139 3.29 36.91 11.22
Baseline+gt. 26.21 19.43 16.50 0.097 3.23 59.08 18.29
Baseline+gl.+ed. 21.85 15.97 13.26 0.242 6.72 42.51 6.33
Baseline+gl.+di. 24.22 18.27 15.42 0.131 3.09 38.19 12.36
Table 3: Ablation study of y𝑦yitalic_y sources on KITTI val set. "lo." means using the local clues branch to predict y𝑦yitalic_y for each object. "fi." means using fixed 1.65 meters as the y𝑦yitalic_y source. "gl." means using the global clue branch to predict. "gt." means directly using the ground plane equation generated by the ground truth of val set. "ed." means using edge detection to obtain the horizon slope in the global clues branch. "di." means using dilated convolution.

In this section, we select MonoFlex [43] as the baseline to discuss the impact of different designs.

Source of Depth Clue.  To demonstrate the effectiveness of introducing global depth clue, we adopt different approaches to obtain depth clue y𝑦yitalic_y, and the results are presented in rows 2, 3, 4, and 5 of Tab. 8. By comparing the ESOP metric, it can be observed that the ESOP of 3rd, 4th, and 5th in Tab. 8 with global characteristic (i.e., not determined by a single object) are significantly higher than that of the baseline and using local clue branch, which demonstrates the necessity of introducing global clues and the coupling of multi-depth prediction is alleviated. In addition, it can be found that the accuracy of zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT is largely related to the accuracy of y𝑦yitalic_y.

By comparing the results of zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT MAE and ESOP pairs under different settings, it can be found that determining whether complementary depth can lead to overall performance enhancement often requires evaluation from two perspectives: depth estimation accuracy and ESOP. This trend can be effectively quantified by complementary scores.

Refer to caption
Figure 5: Qualitative examples on KITTI validation set. In each row, we provide one final front view (left) and four bird’s-eye view (right) visualizations. The detection results for the various bird’s-eye views vary only in terms of the depth output, progressing from zsoftsubscript𝑧𝑠𝑜𝑓𝑡z_{soft}italic_z start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT to zdirsubscript𝑧𝑑𝑖𝑟z_{dir}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT, zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT, and zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT from left to right. Red represents the ground truth of boxes, while Green represents the predictions. We circle some objects to highlight the differences across multiple depth prediction branches.

The results in the 6th to 7th rows of Tab. 8 justify the removal of edge detection and the use of dilated convolution when predicting the ground plane equation.

Complementary Form.

Depth Form Val, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT z𝑧zitalic_z MAE ESOP (%) CS\uparrow
Eazy Mod. Hard
Baseline 23.64 17.51 15.14 - 4.08 -
Eq. 12 23.16 17.62 14.73 2.27 25.69 11.32
Eq. 16 21.83 15.97 13.19 8.65 45.40 5.25
Eq. 14 24.22 18.27 15.42 3.09 38.19 12.36
Table 4: Ablation Study of complementary forms in KITTI val set. z𝑧zitalic_z MAE reflects the depth estimation accuracy in each form

To validate the effectiveness of achieving complementary form in enhancing detection accuracy, we present the results of different depth forms in Tab. 4. According to the results of the 2nd and 4th row in Tab. 4, the ESOP and CS of Eq. 14 are further enhanced after considering the complementary form compared to Eq. 12. Although a part of the depth estimation accuracy is sacrificed, the complementarity and overall performance are eventually improved, which is consistent with observation 3 in Sec. 3.2.

In addition to Eqs. 12 and 14 mentioned in Sec. 3.3, we also consider the following complementary form:

z=fy(ygloH)vtcv𝑧subscript𝑓𝑦subscript𝑦𝑔𝑙𝑜𝐻subscript𝑣𝑡subscript𝑐𝑣z=\frac{f_{y}(y_{glo}-H)}{v_{t}-c_{v}}italic_z = divide start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT - italic_H ) end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG (16)

Although it appears that Eq. 16 is more symmetrical and complementary to zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT in form, its depth estimation error is significantly higher than that of Eq. 14. This is due to the fact that vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and cvsubscript𝑐𝑣c_{v}italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT in the denominator are relatively close, as well as the yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT and H𝐻Hitalic_H in the numerator, which causes an unstable depth estimation. This is also why Eq. 16 has a higher ESOP because the instability of the estimate mitigates the prediction tendency, but it does not contribute to the overall performance. It demonstrates the importance of an appropriate form of complementary depth.

4.6 Qualitative Results

Based on the qualitative results shown in Fig. 5, it can be observed that zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT from the global clue branch is significantly different from zdirsubscript𝑧𝑑𝑖𝑟z_{dir}italic_z start_POSTSUBSCRIPT italic_d italic_i italic_r end_POSTSUBSCRIPT and zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT from the local clue branch and has the opposite error sign. After combining zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT, the predicted box is closer to the ground truth. This visualizes the process of error neutralization.

5 Conclusion

In this paper, we point out the coupling phenomenon that the existing multi-depth predictions tend to have the same sign, which limits the accuracy of combined depth. We analyze how complementary depth fixes it by mathematical derivation and find that the complementarity needs to be considered both from depth estimation accuracy and error sign opposite proportion. To improve depth complementarity, we propose to add a new depth prediction branch with the global clue and achieve complementarity in form through geometric relations. Extensive experiments demonstrate the effectiveness of our method. Limitations. The performance of our framework is limited by the accuracy of the vertical position of objects and the complementary effect may be lost when the ground plane is undulating. Future work could involve improving the understanding and prediction of global road scenarios.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No.62371201), by the Basic Research Surpport Plan of HUST (No.6142113-JCKY2022003), and by the China Scholarship Council for funding visiting Ph.D. student (No.202106160054).

References

  • Brazil and Liu [2019] Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d region proposal network for object detection. In ICCV, pages 9287–9296, 2019.
  • Brazil et al. [2020] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3d object detection in monocular video. In ECCV, pages 135–152. Springer, 2020.
  • Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  • Chen et al. [2021] Hansheng Chen, Yuyao Huang, Wei Tian, Zhong Gao, and Lu Xiong. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In CVPR, pages 10379–10388, 2021.
  • Chen et al. [2015] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. NeurIPS, 28, 2015.
  • Chen et al. [2020] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. In CVPR, pages 12093–12102, 2020.
  • Dijk and Croon [2019] Tom van Dijk and Guido de Croon. How do neural networks see depth in single images? In ICCV, pages 2183–2191, 2019.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361. IEEE, 2012.
  • Huang et al. [2022] Kuan-Chih Huang, Tsung-Han Wu, Hung-Ting Su, and Winston H Hsu. Monodtr: Monocular 3d object detection with depth-aware transformer. In CVPR, pages 4012–4021, 2022.
  • Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS, 30, 2017.
  • Kendall et al. [2018] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, pages 7482–7491, 2018.
  • Kumar et al. [2022] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. Deviant: Depth equivariant network for monocular 3d object detection. In ECCV, pages 664–683. Springer, 2022.
  • Lang et al. [2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, pages 12697–12705, 2019.
  • Li et al. [2019] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, pages 7644–7652, 2019.
  • Li et al. [2021] Peixuan Li, Shun Su, and Huaici Zhao. Rts3d: Real-time stereo 3d detection from 4d feature-consistency embedding space for autonomous driving. In AAAI, pages 1930–1939, 2021.
  • Li et al. [2022a] Yingyan Li, Yuntao Chen, Jiawei He, and Zhaoxiang Zhang. Densely constrained depth estimator for monocular 3d object detection. In ECCV, pages 718–734. Springer, 2022a.
  • Li et al. [2022b] Zhuoling Li, Zhan Qu, Yang Zhou, Jianzhuang Liu, Haoqian Wang, and Lihui Jiang. Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In CVPR, pages 2791–2800, 2022b.
  • Lian et al. [2022] Qing Lian, Peiliang Li, and Xiaozhi Chen. Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In CVPR, pages 1070–1079, 2022.
  • Liu et al. [2022] Xianpeng Liu, Nan Xue, and Tianfu Wu. Learning auxiliary monocular contexts helps monocular 3d object detection. In AAAI, pages 1810–1818, 2022.
  • Liu et al. [2020] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In CVPRW, pages 996–997, 2020.
  • Liu et al. [2021] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. Autoshape: Real-time shape-aware monocular 3d object detection. In ICCV, pages 15641–15650, 2021.
  • Lu et al. [2021] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncertainty projection network for monocular 3d object detection. In ICCV, pages 3111–3121, 2021.
  • Ma et al. [2021] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli Ouyang. Delving into localization errors for monocular 3d object detection. In CVPR, pages 4721–4730, 2021.
  • Peng et al. [2022a] Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, and Deng Cai. Did-m3d: Decoupling instance depth for monocular 3d object detection. In ECCV, pages 71–88. Springer, 2022a.
  • Peng et al. [2022b] Xidong Peng, Xinge Zhu, Tai Wang, and Yuexin Ma. Side: center-based stereo 3d detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 119–128, 2022b.
  • Qian et al. [2022] Rui Qian, Xin Lai, and Xirong Li. 3d object detection for autonomous driving: A survey. Pattern Recognition, 130:108796, 2022.
  • Qin and Li [2022] Zequn Qin and Xi Li. Monoground: Detecting monocular 3d objects from the ground. In CVPR, pages 3793–3802, 2022.
  • Reading et al. [2021] Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. In CVPR, pages 8555–8564, 2021.
  • Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, pages 10529–10538, 2020.
  • Shi et al. [2023] Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023.
  • Shi et al. [2021] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based distance decomposition for monocular 3d object detection. In ICCV, pages 15172–15181, 2021.
  • Shi et al. [2022] Yuguang Shi, Yu Guo, Zhenqiang Mi, and Xinjie Li. Stereo centernet-based 3d object detection for autonomous driving. Neurocomputing, 471:219–229, 2022.
  • Simonelli et al. [2019] Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel López-Antequera, and Peter Kontschieder. Disentangling monocular 3d object detection. In ICCV, pages 1991–1999, 2019.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
  • Wang et al. [2021] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR, pages 454–463, 2021.
  • Xu et al. [2022] Qiangeng Xu, Yiqi Zhong, and Ulrich Neumann. Behind the curtain: Learning occluded shapes for 3d object detection. In AAAI, pages 2893–2901, 2022.
  • Yang et al. [2022] Fan Yang, Xinhao Xu, Hui Chen, Yuchen Guo, Jungong Han, Kai Ni, and Guiguang Ding. Ground plane matters: Picking up ground plane prior in monocular 3d object detection. arXiv preprint arXiv:2211.01556, 2022.
  • Yin et al. [2021] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In CVPR, pages 11784–11793, 2021.
  • Yu and Koltun [2015] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • Yu et al. [2018] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR, pages 2403–2412, 2018.
  • Zhang et al. [2023] Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth-guided transformer for monocular 3d object detection. In ICCV, pages 9155–9166, 2023.
  • Zhang et al. [2021] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are different: Flexible monocular 3d object detection. In CVPR, pages 3289–3298, 2021.
  • Zhang et al. [2022] Yunpeng Zhang, Wenzhao Zheng, Zheng Zhu, Guan Huang, Dalong Du, Jie Zhou, and Jiwen Lu. Dimension embeddings for monocular 3d object detection. In CVPR, pages 1589–1598, 2022.
  • Zhou et al. [2019] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • Zhou et al. [2021] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang. Monoef: Extrinsic parameter free monocular 3d object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):10114–10128, 2021.
  • Zhou et al. [2023] Yunsong Zhou, Hongzi Zhu, Quan Liu, Shan Chang, and Minyi Guo. Monoatt: Online monocular 3d object detection with adaptive token transformer. In CVPR, pages 17493–17503, 2023.
  • Zhu et al. [2023] Minghan Zhu, Lingting Ge, Panqu Wang, and Huei Peng. Monoedge: Monocular 3d object detection using local perspectives. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 643–652, 2023.
\thetitle

Supplementary Material

Appendix A Cross-Dataset Evaluation

To demonstrate the generalizability of our proposed method, we conduct cross-dataset evaluations on KITTI [9] and nuScenes [3] datasets. Following [32], our model is trained on the KITTI training set (3712 images), and evaluated on KITTI (3769 images) and nuScenes frontal (6019 images) validation sets. We also provide the results of retraining MonoCon [20] using the official code but unrestricted from training on distant objects (z>65m𝑧65𝑚z>65mitalic_z > 65 italic_m) as a fair comparison with others. To fit the model trained in KITTI, for the nuScenes dataset, we adjusted the image resolution to 384×672 and the ground plane equation prediction preset height to 1.562m (the ego car height in nuScenes [3]). Neither our method nor MonoCon uses normalized coordinates for the direct depth prediction branch and the images of KITTI and nuScenes have different focal lengths which the direct depth prediction relies on. Thus, following [13], we divide their direct predicted depth by 1.361.

The cross-dataset evaluation results are shown in Tab. 5, our method has lower prediction errors at different object depth ranges, which indicates the effectiveness of the proposed complementary depths in improving overall accuracy. In addition, our method outperforms other methods in most of the metrics on both datasets, which demonstrates the generalizability of our method.

Appendix B Discussion on multi-depth prediction methods

Tab. 6 shows some representative multi-depth prediction methods in recent years. The coupling between their multiple branches is shown in the third column of Tab. 6 in terms of Error Sign Opposite Proportions (ESOP). MonoFlex [43] contains 4 depth prediction branches including 1 directly predicted depth and 3 depths shown in the 2nd row of Tab. 6. MonoGround [28] and our method have 3 additional depth branches on top of them. Since the results of the public branches are similar, for MonoGround and our method, Tab. 6 only shows the results of unshared branches.

It can be observed that the error sign of the 3 depths from keypoint and height is similar to the error sign of the directly predicted depths. Benefiting from the wider range of dense depth supervision, the coupling phenomenon of depths from the ground added by MonoGround [28] is mitigated a bit, but it does not eliminate the coupling. Because its dense supervision comes from local sampled values around the bottom of the object. Although the code of MonoDDE [18] has not been released, a similar coupling phenomenon can be inferred based on the local information it uses. However, after our complementary design, the coupling phenomenon is significantly alleviated and the overall performance is further improved.

Dataset Method Depth prediction MAE (meters)\downarrow
0-20 20-40 40-\infty Alle
KITTI M3D-RPN [1] 0.56 1.33 2.73 1.26
MonoRCNN [32] 0.46 1.27 2.59 1.14
GUPNet [23] 0.45 1.10 1.85 0.89
MonoCon [20] 0.40 1.08 1.78 0.85
MonoCD(Ours) 0.37 1.04 1.72 0.83
nuScenes M3D-RPN [1] 0.94 3.06 10.36 2.67
MonoRCNN [32] 0.94 2.84 8.65 2.39
GUPNet [23] 0.82 1.70 6.20 1.45
MonoCon [20] 0.78 1.65 6.02 1.40
MonoCD(Ours) 0.73 1.59 5.78 1.33
Table 5: Cross-dataset evaluation on KITTI and nuScenes frontal validation with depth prediction MAE.
Model
Branch
dir&
ESOP
(%)\uparrow
Val, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT
Mod.\uparrow
MonoFlex [43] key0 4.08 17.51
key1 5.22
key2 6.19
MonoGround [28] gro0 18.35 18.69
gro1 20.72
gro2 14.73
MonoCD (Ours) comp0 38.19 19.37
comp1 40.24
comp2 40.05
Table 6: Comparison between multiple depth prediction methods. The second column in the table represents the branches used to calculate ESOP with the directly(dir) predicted depth of each model. Including depths from keypoint and height (key), depths from ground (gro), and depths for complementary (comp). Different suffix numbers are used to distinguish the specific branches. The accuracy in the last column is AP40𝐴subscript𝑃40AP_{40}italic_A italic_P start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT for the moderate Car category at 0.7 IoU threshold on KITTI.

Appendix C Additional Experiments on the Effect of Complementary Depths

This section supplements the part of Sec. 3.2 in the main paper that is not presented in detail due to space limits. With the analyses in this section, two experimental conclusions can be obtained:

(1) Existing multiple predicted depths suffer from a common problem of lacking complementarity.

(2) To maximize the complementary effect, it is beneficial to keep prediction branches symmetrical in number.

C.1 Flip on Different Branch

Fipped Branch Proportion of Flipped Samples
0% 25% 50% 75% 100%
dir 17.51 21.02 25.93 31.69 36.12
key0 17.51 21.06 25.78 31.26 35.87
key1 17.51 20.92 25.55 30.87 35.42
key2 17.51 20.85 25.33 29.76 34.92
Table 7: Perform flipping operation on different depth branches according to different sample proportions on KITTI dataset.
Model
Numbers of
Flipped
Branches
Val, AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT
Mod.\uparrow
MonoFlex [43] 0 17.51
1 25.93
2 35.79
3 22.95
4 15.55
MonoGround [28] 0 18.69
1 20.59
2 21.79
3 24.24
4 32.34
5 32.75
6 22.60
7 17.12
Table 8: Evaluation results of two multi-depth prediction models with different numbers of flipped branches on KITTI dataset, where the proportion of flipped samples is fixed at 50%.

As shown in Tab. 7, we perform flipping on different branches of MonoFlex [43] according to different flipped sample proportions. The first row of results in the table is presented to the left of Fig. 3 in the main paper. It can be observed that the results of selecting different branches for flipping are similar, which indicates that the coupling between multiple-depth branches is relatively similar and lacking complementarity is common.

C.2 Flip with Different Numbers of Branches

To maximize the complementary effects, we additionally conducted an analytical study on two multi-depth prediction models with different numbers of flipped branches. The results in Tab. 8 show that realizing branch flips with different numbers is effective in improving performance except in the case where all branches are flipped. This is because although the accuracy of the depth prediction does not change with flipping, the depth values will be completely flipped to the other side of the ground truth. According to Eq. (1) in the main paper, it introduces additional error to the predicted x𝑥xitalic_x and y𝑦yitalic_y, resulting in a decrease in the accuracy of the predicted 3D bounding box.

Furthermore, it is worth noting that both models perform best when the number of flipped branches and the number of unflipped branches are close to the same. This indicates that for multiple depth prediction branches with complementary effects, maintaining a certain level of symmetry in number is preferable to maximize their effectiveness. This is why we follow the number of zkeysubscript𝑧𝑘𝑒𝑦z_{key}italic_z start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT and design three symmetrical zcompsubscript𝑧𝑐𝑜𝑚𝑝z_{comp}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT in the main paper.

Setting Combined Depth prediction MAE (meters) \downarrow
yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT MAE (meters) \downarrow overall
0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-\infty
Proportion of samples (%)
54.09 27.37 9.61 4.77 4.15
Baseline 0.90 1.17 1.72 1.84 2.78 1.18
MonoCD(Ours) 0.85 1.13 1.66 1.82 3.02 1.14
Table 9: The system robustness evaluation in KITTI val set, which contains five levels based on the MAE of yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT. The larger the value, the worse the conditions the system faces. The percentage under each level represents the proportion of samples.

Appendix D System Robustness Evaluation

As we discussed in the limitations of the main paper, the performance of our method is affected by the estimation of the ground plane equation and keypoints. Thus, we conduct a system robustness evaluation to check the performance of our method in severe conditions as shown in Tab. 9. For our added complementary depths, the effect of inaccuracies in ground plane estimation or keypoint detection is directly reflected in the prediction error of yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT. Therefore, we divide the samples into five levels according to the MAE of yglosubscript𝑦𝑔𝑙𝑜y_{glo}italic_y start_POSTSUBSCRIPT italic_g italic_l italic_o end_POSTSUBSCRIPT and count the mean absolute error of the combined depth at each level. It can be observed that our method outperforms the baseline in most cases, and in a few severe conditions (less than 5%), the performance of our method degrades. This problem will be alleviated by enhancing the understanding of road scenes in the future.