\useunder

\ul

MonoCD: Monocular 3D Object Detection with Complementary Depths

Longfei Yan¹ Pei Yan¹ Shengzhou Xiong¹ Xuanyu Xiang¹ Yihua Tan¹
¹Hubei Engineering Research Center of Machine Vision and Intelligent Systems,
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China
{longfeiyan, yanpei}@hust.edu.cn, [email protected], {xuanyuxiang, yhtan}@hust.edu.cn Corresponding author.

Abstract

Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the similarity of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.

1 Introduction

Refer to caption — Figure 1: (a) Comparison of coupling(coup) and complementary(comp) multi-depth with two depth branches $Z_{1}$ and $Z_{2}$ , where $Z^{*}$ and $Z_{soft}$ represents the ground truth of the depth and the final combined depth respectively. (b) A complementary demonstration of the two depth branches with the help of geometrical relations when considering only the inaccurate estimation of the object 3D height $H$ . Both $Z_{1}$ generated by the widely used local height clue and $Z_{2}$ generated by our newly introduced global clue $y_{glo}$ are related to $H$ . $H^{*}$ and $\hat{H}$ denote the ground truth of $H$ and the underestimated $H$ respectively.

As a significant research topic in both academia and industry, 3D object detection can empower non-human intelligences to perceive the 3D world. Compared with LiDAR-based [14, 30, 31, 37] and stereo-based [16, 33, 26, 15] approaches, monocular 3D object detection has attracted widespread attention due to its lower price and simpler configuration [18, 27]. However, its 3D localization accuracy is significantly lower than those based on LiDAR and stereo. To advance and promote automation technologies, such as autonomous driving and robotics, it is essential to enhance the 3D localization precision of monocular 3D object detection.

Recently, many monocular 3D object detection algorithms have realized that the main reason limiting the 3D localization precision of monocular 3D object detection is inaccurate depth estimation [28, 48, 25, 43, 18]. Following mainstream CenterNet paradigm [45], they explore multiple local depth clues and formulate depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. For instance, MonoFlex [43] explores local depth clues from direct estimate and object heights, and subsequently combines them into one depth by weighted averaging. MonoDDE [18] further reveals clues from the object perspective point on top of that.

However, experiments on KITTI dataset [9] show that 95% of the existing multi-depth prediction ensembles have the same error sign, i.e., multiple predicted depths are usually distributed on the same side of the ground truth as shown by the coupling in Fig. 1(a), which leads to depth errors that cannot be neutralized with each other, hindering the improvement of combined depth accuracy. We attribute this coupling phenomenon to the fact that the local depth clues they used are all derived from the same local features around the object in the CenterNet paradigm.

In this paper, we propose to increase the complementarity of depths to alleviate the problem. Complementarity here refers that these predictions not only aim for high accuracy but also have different error signs. To this end, we propose two novel designs. First, considering the aforementioned coupling phenomenon, we add a new depth prediction branch that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the similarity of depth predictions. It relies on the global information that all objects in one image approximately lie on the same plane. Second, to further improve complementarity, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form, which utilizes the fact that errors in the same geometric quantity may have opposite effects on different branches. For example, in Fig. 1(b), $Z_{1}$ has a negative error because the related clue 3D height $H$ is underestimated, whereas in this case, $Z_{2}$ has a positive error because the effect of $H$ on $Z_{2}$ combined with new clues $y_{glo}$ is opposite to $Z_{1}$ . Therefore, the geometric relation based on $H$ provides complementarity to $Z_{1}$ and $Z_{2}$ in form.

Incorporating all the designs, we propose a novel monocular 3D detector with complementary depths, named MonoCD, which compensates for the complementarity neglected in previous multi-depth predictions. The main contributions of this paper are summarized as follows:

•

We point out the coupling of existing monocular object depth predictions, which limits the accuracy of the combined depths. Therefore we propose to improve the depths complementarity to alleviate this problem.
•

We propose to add a new depth prediction branch named complementary depth that utilizes global and efficient depth clue and fully exploit the geometric relations between multiple depth clues to achieve complementarity in form.
•

Evaluated on KITTI benchmark, our method achieves state-of-the-art performance without introducing extra data. Moreover, complementary depth can be a lightweight plug-and-play module to boost multiple existing detectors.

2 Related work

2.1 Center-based Monocular 3D Detector

Many recent works [39, 7, 46, 23, 19, 44] are extended from the popular center-based paradigm CenterNet [45], which is an anchor-free method initially applied to 2D object detection. It makes the detection process simpler and more efficient due to converting all attributes of a 3D bounding box into a center to estimate. SMOKE [21] inherits the center-based framework and proposes that the estimation of the 2D bounding box can be omitted. MonoDLE [24] finds that the estimation of the 2D bounding box contributes to the prediction of 3D attributes and demonstrates that depth error is the main reason limiting the accuracy of monocular 3D object detection. MonoCon [20] finds that adding auxiliary learning tasks around the center can improve the generalization performance. Although there are many benefits in the center-based framework, it makes the prediction of all 3D attributes highly correlated with the local center. It ignores the exploitation of global information, leading to the coupling of predicted 3D attributes.

2.2 Transformer-based Monocular 3D Detector

Benefiting from the non-local encoding of attention mechanism [35] and its development in object detection [4], multiple Transformer-based monocular 3D detectors have recently been proposed to enhance the global perception capability. MonoDTR [10] proposes to perform depth position encoding to inject global depth information into Transformer to guide the detection, which requires LIDAR for auxiliary supervision. Different from it, MonoDETR [42] uses foreground object labels to predict foreground Depth Maps to achieve depth guidance. In order to improve the inference efficiency, MonoATT [47] proposes an adaptive token Transformer and makes it possible for finer tokens to be assigned to more significant regions in images. Although the above methods perform well, the drawbacks of high computational complexity and slow inference of Transformer-based monocular 3D detectors are still apparent. Thus there is currently a lack of a method that has both the capability of synthesizing global information and low latency in real-world autonomous driving scenarios.

2.3 Estimation of Multi-Depth

In addition to directly estimating object depth using deep neural networks, many recent works have broadened the depth estimation branch by mediately predicting geometric clues associated with depth. [32, 23] utilizes mathematical priors and uncertainty modeling to restore depth information through the ratio of 3D to 2D height. Based on them, MonoFlex [43] further extends the geometric depths to three sets by other supporting lines of the 3D bounding box and proposes to use uncertainties as weights to combine multiple depths into a final depth. MonoGround [28] introduces a local ground plane prior and enriches the depth supervision sources using randomly sampling dense points in the bottom plane of each object. MonoDDE [18] utilizes keypoint information to expand the number of depth prediction branches to 20, highlighting the importance of depth diversity. However, the complementarity between multiple depths is hardly explored. Errors in geometric clues (such as 2D/3D height) accumulate into the corresponding depth errors. Without effective complementarity, existing depth errors cannot be neutralized.

3 Approach

3.1 Problem Definition

The task of monocular 3D object detection is to recognize objects of interest from a 2D image only and predict their corresponding 3D attributes including 3D location $(x,y,z)$ , dimension $(h,w,l)$ , and orientation $\theta$ . The 3D location $(x,y,z)$ is usually transformed into 2.5D information $(u_{c},v_{c},z)$ for prediction. The recovery process of $x$ and $y$ can be formulated as:

x=\frac{(u_{c}-c_{u})z}{f_{x}},\quad y=\frac{(v_{c}-c_{v})z}{f_{y}}

(1)

where $(u_{c},v_{c})$ is the projected 3D center in the image and $(c_{u},c_{v})$ is the camera optical center. $f_{x}$ and $f_{y}$ denote the horizontal and vertical focal lengths respectively.

As described in Sec. 1, many methods [43, 28, 18] have realized that depth $z$ is the main reason limiting the performance of monocular 3D detector and utilize multi-depth to improve the accuracy of depth prediction via:

z_{soft}=\displaystyle\sum_{i=1}^{n}w_{i}z_{i}

(2)

where $\{z_{i}\}_{i=1}^{n}$ represents n predicted depths and $\{w_{i}\}_{i=1}^{n}$ represents their weights determined by the predicted uncertainty [11, 12]. $z_{soft}$ is used as the final depth of the output.

3.2 The Effect of Complementary Depths

To demonstrate the effectiveness of complementary depths, we present its superiority from a mathematical perspective. Define two different depth prediction branches $\hat{z}_{1}$ and $\hat{z}_{2}$ as follows:

\hat{z}_{1}=z^{*}+e_{1},\quad\hat{z}_{2}=z^{*}+e_{2}

(3)

where $z^{*}$ represents the ground truth of depth. $e_{1}$ and $e_{2}$ are the errors of the two depth branches in a single prediction, respectively. Note that the positive and negative of $e_{1}$ and $e_{2}$ correspond to the sign of error. We define $e_{1}e_{2}>0$ to simulate the case of multiple depth coupling, as shown in Fig. 1(a). We term the final combination error of multiple coupling depths as coupling depth error. Hence, referring to Eq. 2, the coupling depth error $E_{1}$ of $\hat{z}_{1}$ and $\hat{z}_{2}$ can be formulated as:

	$\displaystyle E_{1}$	$\displaystyle=\|w_{1}\hat{z}_{1}+w_{2}\hat{z}_{2}-z^{*}\|$		(4)
		$\displaystyle=\|w_{1}e_{1}+w_{2}e_{2}\|$		(4)

where $w_{1}$ and $w_{2}$ satisfy $w_{1},w_{2}>0$ and $w_{1}+w_{2}=1$ . We then flip $\hat{z}_{1}$ symmetrically along $z^{*}$ without changing the accuracy of the prediction through:

	$\displaystyle\hat{z}_{1}^{\prime}$	$\displaystyle=z^{}-(\hat{z}_{1}-z^{})$		(5)
		$\displaystyle=z^{*}-e_{1}$		(5)

After flipping, the error sign in $\hat{z}_{1}^{\prime}$ and $\hat{z}_{2}$ are opposite and higher complementarity between them is artificially achieved. We term the final combination error of multiple complementary depths as complementary depth error. Similarly, the complementary depth error $E_{2}$ of $\hat{z}_{1}^{\prime}$ and $\hat{z}_{2}$ can be formulated as:

	$\displaystyle E_{2}$	$\displaystyle=\|w_{1}\hat{z}_{1}^{\prime}+w_{2}\hat{z}_{2}-z^{*}\|$		(6)
		$\displaystyle=\|w_{1}e_{1}-w_{2}e_{2}\|$		(6)

By mathematical transformations we further express Eqs. 4 and 6 as:

	$\displaystyle E_{1}$	$\displaystyle=\sqrt{(w_{1}e_{1}+w_{2}e_{2})^{2}}$		(7)
		$\displaystyle=\sqrt{(w_{1}e_{1})^{2}+2w_{1}w_{2}e_{1}e_{2}+(w_{2}e_{2})^{2}}$		(7)

	$\displaystyle E_{2}$	$\displaystyle=\sqrt{(w_{1}e_{1}-w_{2}e_{2})^{2}}$		(8)
		$\displaystyle=\sqrt{(w_{1}e_{1})^{2}-2w_{1}w_{2}e_{1}e_{2}+(w_{2}e_{2})^{2}}$		(8)

It is obvious that the complementary depth error $E_{2}$ is consistently less than the coupling depth error $E_{1}$ due to the condition $e_{1}e_{2}>0$ . Regardless of variations in weight or error magnitude, this relationship remains constant. Similarly, the conclusion is equivalent by maintaining $z_{1}$ unchanged during the flip of $z_{2}$ . Therefore we can draw the conclusion: realizing the complementary relationship between two depth branches contributes to reducing the overall depth error, even without improving the accuracy of individual branches.

To demonstrate the effectiveness of complementary depths in practice, we select a classical multi-depth prediction baseline [43] for evaluation in KITTI val set. It contains 4 depth prediction branches (1 directly estimated depth and 3 geometric depths) and the coupling rate of any two branches is around 95% after testing. As shown on the left in Fig. 3, we flip the direct depth estimation branch among them symmetrically along the ground truth based on Eq. 5 across a 0% to 100% sample scale to achieve depths complementary at different levels. Additionally, considering the difficulty of obtaining depth predictions with opposite error signs while maintaining the same accuracy in practice, we conduct another experiment by flipping the depth branch while applying random disturbances of different magnitudes on top of it. The results are presented on the right of Fig. 3. Similar results are observed in other branches by performing the same operation as above. Based on this, we have the following three observations:

Observation 1: On the left of Fig. 3, the detection accuracy increases as the proportion of flipped samples rises. It demonstrates that increasing complementarity between multiple depth prediction branches can improve detection accuracy continuously.

Observation 2: For two independent depth prediction branches, ideally, the proportion of their predictions with opposite signs in all samples should be 50%. The situation is similar to the 50% flipped proportion on the left of Fig. 3 due to the coupling of multiple branches in the baseline. Therefore reducing the similarity of multiple depth prediction branches can also increase their complementarity.

Observation 3: In the case where the flipped proportion is fixed at 50%, as shown in the right of Fig. 3, it is not until the application of random disturbance with an amplitude of 2 meters (which is quite significant [24] for Car in KITTI) that the complementary effect disappeared. This indicates that complementary effect can still contribute to overall performance even if losing some depth estimation accuracy and ultimately whether the overall performance can be improved depends on both the proportion of opposite signs and the depth estimation accuracy.

Additionally, we select models with different total numbers of depth prediction branches to perform flipping and evaluation. We find that as the number of flipped branches approaches the number of unflipped branches, the overall performance improves accordingly. For more experiments and details, please refer to the supplementary materials.

3.3 3D Detector with Complementary Depths

Framework Overview. As shown in Fig. 2, the network we design extends from CenterNet [45]. The regression heads are divided into two parts: local clues and global clues, where DLA-34 [41] is chosen as the backbone of the network. The branch of local clues is designed with reference to MonoFlex [43], which estimates dimension, keypoints, direct depth, orientation, and 2D detection for each local peak point based on the predicted Heatmap. Since the prediction of these geometric quantities is highly correlated with the position of the local peak point in the image, they are referred to as local clues. Both $z_{dir}$ and $z_{key}$ are derived from them. The branch of global clues predicts the Horizon Heatmap of the entire image based on all extracted pixel features, which is used to obtain the trend of $y_{glo}$ in scenes, and then outputs the complementary depth $z_{comp}$ embedding the global clues. How to construct a depth prediction branch with the global clues and further achieve complementarity in form will be elaborated below. Following [11, 12], we model uncertainty for all seven depth predictions (1 direct depth, 3 keypoint depths, and 3 complementary depths augmented by diagonal columns as [43]). The final depth is obtained according to Eq. 2, with $w_{i}=\frac{1}{\sigma_{i}}$ .

Depth Prediction with Global Clues. Inspired by [8], the neural network sees depth from a single image through:

z=\frac{{f}_{y}y}{v_{b}-c_{v}}

(9)

where $y$ denotes the $y$ -axis coordinates of the object in the camera coordinate system, and $v_{b}$ denotes the vertical coordinate of the projected bottom center in the pixel coordinate system. Considering that $y$ also represents the elevation of the plane in which the objects are located and that all objects lie approximately in one plane, $y$ contains such a global characteristic and can be distinguished from other depth clues. Unlike previous neural networks that implicitly utilize Eq. 9, we propose to predict $y$ explicitly.

To avoid falling into the coupling, we do not utilize the center-based approach discussed in Sec. 2.1 to predict $y$ . We propose to first obtain the sloping trend of $y$ in the scene by the ground plane equation. The prediction of the ground plane equation is based on the Horizon Heatmap branch, similar to [38], but we omit the edge prediction and obtain prediction results as:

	$\displaystyle Ax+By+Cz+1.65=0$		(10)
	$\displaystyle s.t.\quad A^{2}+B^{2}+C^{2}=1$		(10)

where $A=F\frac{k_{h}f_{x}}{f_{y}}$ , $B=-F$ and $C=F\frac{k_{h}c_{u}+b_{h}-c_{v}}{f_{y}}$ . $k_{h}$ and $b_{h}$ represent the slope and intercept of the horizon fitted by Horizon Heatmap. After it, then considering Eq. 1 and the projected bottom center $(u_{b},v_{b})$ of the object, $y$ with global information can be derived as:

{y}_{glo}=-\frac{1.65}{An+Cm+B}

(11)

where $n=\frac{f_{y}(u_{b}-c_{u})}{f_{x}(v_{b}-c_{v})}$ , $m=\frac{f_{y}}{v_{b}-c_{v}}$ .

Inserting Eq. 11 into Eq. 9, a new depth prediction branch with the global clue is obtained:

z_{glo}=\frac{{f}_{y}y_{glo}}{v_{b}-c_{v}}

(12)

In addition, to better utilize the global features as well as to expand the receptive field, we use dilated convolution [40] to predict the Horizon Heatmap.

Complementary Form in Solving. Simply achieving more independent depth prediction is not enough, we hope to fully exploit the geometric relations between multiple depth prediction branches to improve complementarity further. Considering the projected bottom center $(u_{b},v_{b})$ and top center $(u_{t},v_{t})$ , as shown in the orange part of Fig. 4, the depth derived from keypoint and height in [32] can be rewritten as:

z_{key}=\frac{f_{y}H}{v_{b}-v_{t}}

(13)

where $H$ represents the 3D height of the object. Combining the global $y_{glo}$ information obtained by Eq. 11 and the geometric quantities used in Eq. 13, we further propose a depth prediction that is complementary to $z_{key}$ in form:

z_{comp}=\frac{f_{y}(y_{glo}-\frac{1}{2}H)}{\frac{1}{2}(v_{b}+v_{t})-c_{v}}

(14)

The geometric correspondence is shown in the blue part of Fig. 4. It can be observed that the signs of $H$ and $v_{t}$ in the designed Eq. 14 are exactly opposite to those in Eq. 13. This means that the errors of $H$ and $v_{t}$ have opposite effects on $z_{key}$ and $z_{comp}$ during the prediction of 3D information for each object. Although Eq. 13 and Eq. 14 are not strictly symmetrical, this further increases the probability that the errors $e_{key}$ and $e_{comp}$ of $z_{key}$ and $z_{comp}$ satisfy the condition of $e_{key}e_{comp}<0$ . As proved by Sec. 3.2, eventually a part of the depth error is neutralized in the weighted averaging of Eq. 2.

4 Experiments

Methods, Venues	Extra data	Test, $AP_{3D}$			Test, $AP_{BEV}$			Time(ms)
Methods, Venues	Extra data	Eazy	Mod.	Hard	Eazy	Mod.	Hard	Time(ms)
DDMP-3D [36], CVPR2021	Depth	19.71	12.78	9.80	28.08	17.89	13.44	180
Kinematic3D [2], ECCV2020	Video	19.07	12.72	9.17	26.69	17.52	13.10	120
AutoShape [22], ICCV2021	CAD	22.47	14.17	11.36	30.66	20.08	15.59	50
DCD [17], ECCV2022	CAD	23.81	15.90	13.21	32.55	21.50	18.25	-
MonoRUn [5], CVPR2021	LiDAR	19.65	12.30	10.58	27.94	17.34	15.24	70
CaDDN [29], CVPR2021		19.17	13.41	11.46	27.94	18.91	17.19	630
MonoDTR [10], CVPR2022		21.99	15.39	12.73	28.59	20.38	17.14	37
SMOKE [21], CVPRW2020	None	14.03	9.76	7.84	20.83	14.49	12.75	30
MonoDLE [24], CVPR21		17.23	12.26	10.29	24.79	18.89	16.00	40
MonoRCNN [32], ICCV2021		18.36	12.65	10.03	25.48	18.11	14.10	70
MonoFlex [43], CVPR2021		19.94	13.89	12.07	28.23	19.75	16.89	35
MonoGround [28], CVPR2022		21.37	14.36	12.62	30.07	20.47	17.74	30
GPENet [38], -		22.41	15.44	12.84	30.31	20.79	18.21	-
MonoJSG [19], CVPR2022		24.69	16.14	13.64	32.59	21.26	18.18	42
MonoCon [20], AAAI2022		22.50	16.46	\ul13.95	31.12	22.10	\ul19.00	25.8
MonoDETR [42], ICCV2023		\ul25.00	\ul16.47	13.58	33.60	\ul22.11	18.60	43
MonoCD(Ours)	None	25.53	16.59	14.53	\ul33.41	22.81	19.57	36
Improvement	v.s. second-best	+0.53	+0.12	+0.58	-0.19	+0.70	+0.57	-

Table 1: Comparison with current state-of-the-art methods on Car category on the KITTI test set. Methods are grouped according to extra data. Follow [9], the methods in each group are sorted by

{AP}_{3D}

performance in Moderate difficulty setting. We bold the best results and \ulunderline the second results.

4.1 Dataset

Our experiments are conducted on the widely-adopted KITTI 3D Object [9] dataset, which contains 7481 training images and 7518 test images. Since the annotations of the test images are not publicly accessible, we follow [6] and further divide the 7481 training images into 3712 and 3769 as the training and validation sets, respectively. Each category is further refined into three difficulties: Easy, Moderate, and Hard based on 2D height, truncation, and occlusion.

4.2 Evaluation Metrics

As in previous methods, we use Average Precision ${AP}_{3D}$ and ${AP}_{BEV}$ as the overall evaluation metrics. Following [34], 40 recall positions are used for the above AP calculations. The IoU threshold is 0.7 for Car.

In the ablation study of Sec. 4.5, the mean absolute error (MAE) of $y$ is introduced as a metric to evaluate the accuracy of the different $y$ sources. In addition, to better measure the complementarity between different designs, we quantify the magnitude of complementarity as the Complementarity Score. As discussed in Sec. 3.2, both the error sign opposite proportion and depth estimation accuracy are crucial in achieving enhanced performance. Thus we formulate the Complementarity Score(CS) as:

CS=\frac{ESOP_{z}}{{MAE}_{z}}

(15)

where $ESOP_{z}$ represents depths Error Sign Opposite Proportion (ESOP) between global and local clue branches, and ${MAE}_{z}$ represents the Mean Absolute Error of $z_{comp}$ . For a baseline without $z_{comp}$ , ESOP counts the proportion between $z_{key}$ and $z_{dir}$ .

4.3 Implementation Details

In order to demonstrate the effectiveness of the proposed framework, we choose three recent center-based methods with excellent performance as the baseline model, MonoFlex [43], MonoDLE [24], and MonoCon [20]. All experiments are performed on a single RTX 2080Ti GPU. The aforementioned baseline models all employ DLA-34 [41] as the feature extraction network. In the Global Clues branch, the prediction head of Horizon Heatmap contains two 3×3 conv layers with BN and ReLU (where the dilation rate is set to 2) and an output conv layer. The horizon equation is obtained by taking out all the largest elements in each column of the Horizon heatmap and fitting them. The ground truth of Horizon Heatmap is generated by fitting the scene ground plane through the bottom coordinate annotation of each object and then projecting to the 2D image plane [38], so only RGB image data and camera annotations are used throughout the training process. The radius of the Gaussian kernel used for each pixel is 2 when mapping the horizon equation into Heatmap. The $z_{direct}$ , $z_{key}$ and $z_{comp}$ loss weight proportions are set to $1:0.2:0.1$ . The remaining settings such as optimizer, batch sizes, image padding size, etc. remain consistent with the baseline.

4.4 Quantitative Results

To demonstrate the effectiveness of the proposed method, we conduct quantitative experiments on test and val sets of KITTI [9].

As shown in Tab. 6, the proposed method is compared with the state-of-the-art methods in recent years on the widely used KITTI test set. Our method achieves the best performance in the majority of metrics without using any additional data. Compared with the previous multi-depth solving method MonoFlex [43], our performance for ${AP}_{3D}/{AP}_{BEV}$ improves by 19.44%/15.49%, respectively. The performance for ${AP}_{3D}/{AP}_{BEV}$ improves from 15.44/20.79 to 16.59/22.81 compared to the method GPENet [38], which also incorporated the ground plane equation solution. Even when compared to the latest Transformer-based detector MonoDETR [42], we outperform it in most metrics while ensuring real-time operation.

	Val, $AP_{BEV}$			Val, $AP_{3D}$
Method	Eazy	Mod.	Hard	Eazy	Mod.	Hard
MonoDLE [24]	24.97	19.33	17.01	17.45	13.66	11.68
+ Ours	26.84	20.86	17.89	18.60	15.09	12.86
Improvement	+1.87	+1.53	+0.88	+1.15	+1.43	+1.18
MonoFlex [43]	30.51	23.16	19.87	23.64	17.51	15.14
+ Ours	31.49	23.56	20.12	24.22	18.27	15.42
Improvement	+0.98	+0.40	+0.25	+0.58	+0.76	+0.28
MonoCon [20]	33.36	24.39	21.03	26.33	19.01	15.98
+ Ours	34.60	24.96	21.51	26.45	19.37	16.38
Improvement	+1.24	+0.57	+0.48	+0.12	+0.36	+0.40

Table 2: In order to fully demonstrate the effectiveness of the proposed method, we extend complementary depth to three center-based monocular 3D detectors. Evaluation is performed on the KITTI val set. The increased performance is highlighted in blue.

As shown in Tab. 7, we extend the complementary depth branch to three competitive center-based monocular 3d detectors. The results of the KITTI val set demonstrate that the proposed complementary depth is flexible and achieves stable increments across multiple frameworks and metrics. It is worth noting that the boost of our design performs better on ${AP}_{BEV}$ than ${AP}_{3D}$ in general. We attribute this to the focus of our method on improvements in depth estimation, since ${AP}_{BEV}$ is more emphasis on the accuracy of localization along the Z-axis compared to ${AP}_{3D}$ [9].

4.5 Ablation Study

Setting	Val, $AP_{3D}$			$y$ MAE	$z_{comp}$ MAE	ESOP (%)	CS $\uparrow$
Setting	Eazy	Mod.	Hard	$y$ MAE	$z_{comp}$ MAE	ESOP (%)	CS $\uparrow$
Baseline	23.64	17.51	15.14	-	-	4.08	-
Baseline+lo.	18.41	13.49	10.90	0.127	4.03	18.63	4.62
Baseline+fi.	21.93	15.86	13.22	0.250	8.47	45.72	5.40
Baseline+gl.	22.97	17.85	15.11	0.139	3.29	36.91	11.22
Baseline+gt.	26.21	19.43	16.50	0.097	3.23	59.08	18.29
Baseline+gl.+ed.	21.85	15.97	13.26	0.242	6.72	42.51	6.33
Baseline+gl.+di.	24.22	18.27	15.42	0.131	3.09	38.19	12.36

Table 3: Ablation study of

y

sources on KITTI val set. "lo." means using the local clues branch to predict

y

for each object. "fi." means using fixed 1.65 meters as the

y

source. "gl." means using the global clue branch to predict. "gt." means directly using the ground plane equation generated by the ground truth of val set. "ed." means using edge detection to obtain the horizon slope in the global clues branch. "di." means using dilated convolution.

In this section, we select MonoFlex [43] as the baseline to discuss the impact of different designs.

Source of Depth Clue. To demonstrate the effectiveness of introducing global depth clue, we adopt different approaches to obtain depth clue $y$ , and the results are presented in rows 2, 3, 4, and 5 of Tab. 8. By comparing the ESOP metric, it can be observed that the ESOP of 3rd, 4th, and 5th in Tab. 8 with global characteristic (i.e., not determined by a single object) are significantly higher than that of the baseline and using local clue branch, which demonstrates the necessity of introducing global clues and the coupling of multi-depth prediction is alleviated. In addition, it can be found that the accuracy of $z_{comp}$ is largely related to the accuracy of $y$ .

By comparing the results of $z_{comp}$ MAE and ESOP pairs under different settings, it can be found that determining whether complementary depth can lead to overall performance enhancement often requires evaluation from two perspectives: depth estimation accuracy and ESOP. This trend can be effectively quantified by complementary scores.

The results in the 6th to 7th rows of Tab. 8 justify the removal of edge detection and the use of dilated convolution when predicting the ground plane equation.

Complementary Form.

Depth Form	Val, $AP_{3D}$			$z$ MAE	ESOP (%)	CS $\uparrow$
Depth Form	Eazy	Mod.	Hard	$z$ MAE	ESOP (%)	CS $\uparrow$
Baseline	23.64	17.51	15.14	-	4.08	-
Eq. 12	23.16	17.62	14.73	2.27	25.69	11.32
Eq. 16	21.83	15.97	13.19	8.65	45.40	5.25
Eq. 14	24.22	18.27	15.42	3.09	38.19	12.36

Table 4: Ablation Study of complementary forms in KITTI val set.

z

MAE reflects the depth estimation accuracy in each form

To validate the effectiveness of achieving complementary form in enhancing detection accuracy, we present the results of different depth forms in Tab. 4. According to the results of the 2nd and 4th row in Tab. 4, the ESOP and CS of Eq. 14 are further enhanced after considering the complementary form compared to Eq. 12. Although a part of the depth estimation accuracy is sacrificed, the complementarity and overall performance are eventually improved, which is consistent with observation 3 in Sec. 3.2.

In addition to Eqs. 12 and 14 mentioned in Sec. 3.3, we also consider the following complementary form:

z=\frac{f_{y}(y_{glo}-H)}{v_{t}-c_{v}}

(16)

Although it appears that Eq. 16 is more symmetrical and complementary to $z_{key}$ in form, its depth estimation error is significantly higher than that of Eq. 14. This is due to the fact that $v_{t}$ and $c_{v}$ in the denominator are relatively close, as well as the $y_{glo}$ and $H$ in the numerator, which causes an unstable depth estimation. This is also why Eq. 16 has a higher ESOP because the instability of the estimate mitigates the prediction tendency, but it does not contribute to the overall performance. It demonstrates the importance of an appropriate form of complementary depth.

4.6 Qualitative Results

Based on the qualitative results shown in Fig. 5, it can be observed that $z_{comp}$ from the global clue branch is significantly different from $z_{dir}$ and $z_{key}$ from the local clue branch and has the opposite error sign. After combining $z_{comp}$ , the predicted box is closer to the ground truth. This visualizes the process of error neutralization.

5 Conclusion

In this paper, we point out the coupling phenomenon that the existing multi-depth predictions tend to have the same sign, which limits the accuracy of combined depth. We analyze how complementary depth fixes it by mathematical derivation and find that the complementarity needs to be considered both from depth estimation accuracy and error sign opposite proportion. To improve depth complementarity, we propose to add a new depth prediction branch with the global clue and achieve complementarity in form through geometric relations. Extensive experiments demonstrate the effectiveness of our method. Limitations. The performance of our framework is limited by the accuracy of the vertical position of objects and the complementary effect may be lost when the ground plane is undulating. Future work could involve improving the understanding and prediction of global road scenarios.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No.62371201), by the Basic Research Surpport Plan of HUST (No.6142113-JCKY2022003), and by the China Scholarship Council for funding visiting Ph.D. student (No.202106160054).

References

Brazil and Liu [2019] Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d region proposal network for object detection. In ICCV, pages 9287–9296, 2019.
Brazil et al. [2020] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3d object detection in monocular video. In ECCV, pages 135–152. Springer, 2020.
Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
Chen et al. [2021] Hansheng Chen, Yuyao Huang, Wei Tian, Zhong Gao, and Lu Xiong. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In CVPR, pages 10379–10388, 2021.
Chen et al. [2015] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. NeurIPS, 28, 2015.
Chen et al. [2020] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. In CVPR, pages 12093–12102, 2020.
Dijk and Croon [2019] Tom van Dijk and Guido de Croon. How do neural networks see depth in single images? In ICCV, pages 2183–2191, 2019.
Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361. IEEE, 2012.
Huang et al. [2022] Kuan-Chih Huang, Tsung-Han Wu, Hung-Ting Su, and Winston H Hsu. Monodtr: Monocular 3d object detection with depth-aware transformer. In CVPR, pages 4012–4021, 2022.
Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS, 30, 2017.
Kendall et al. [2018] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, pages 7482–7491, 2018.
Kumar et al. [2022] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. Deviant: Depth equivariant network for monocular 3d object detection. In ECCV, pages 664–683. Springer, 2022.
Lang et al. [2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, pages 12697–12705, 2019.
Li et al. [2019] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, pages 7644–7652, 2019.
Li et al. [2021] Peixuan Li, Shun Su, and Huaici Zhao. Rts3d: Real-time stereo 3d detection from 4d feature-consistency embedding space for autonomous driving. In AAAI, pages 1930–1939, 2021.
Li et al. [2022a] Yingyan Li, Yuntao Chen, Jiawei He, and Zhaoxiang Zhang. Densely constrained depth estimator for monocular 3d object detection. In ECCV, pages 718–734. Springer, 2022a.
Li et al. [2022b] Zhuoling Li, Zhan Qu, Yang Zhou, Jianzhuang Liu, Haoqian Wang, and Lihui Jiang. Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In CVPR, pages 2791–2800, 2022b.
Lian et al. [2022] Qing Lian, Peiliang Li, and Xiaozhi Chen. Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In CVPR, pages 1070–1079, 2022.
Liu et al. [2022] Xianpeng Liu, Nan Xue, and Tianfu Wu. Learning auxiliary monocular contexts helps monocular 3d object detection. In AAAI, pages 1810–1818, 2022.
Liu et al. [2020] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In CVPRW, pages 996–997, 2020.
Liu et al. [2021] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. Autoshape: Real-time shape-aware monocular 3d object detection. In ICCV, pages 15641–15650, 2021.
Lu et al. [2021] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncertainty projection network for monocular 3d object detection. In ICCV, pages 3111–3121, 2021.
Ma et al. [2021] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli Ouyang. Delving into localization errors for monocular 3d object detection. In CVPR, pages 4721–4730, 2021.
Peng et al. [2022a] Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, and Deng Cai. Did-m3d: Decoupling instance depth for monocular 3d object detection. In ECCV, pages 71–88. Springer, 2022a.
Peng et al. [2022b] Xidong Peng, Xinge Zhu, Tai Wang, and Yuexin Ma. Side: center-based stereo 3d detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 119–128, 2022b.
Qian et al. [2022] Rui Qian, Xin Lai, and Xirong Li. 3d object detection for autonomous driving: A survey. Pattern Recognition, 130:108796, 2022.
Qin and Li [2022] Zequn Qin and Xi Li. Monoground: Detecting monocular 3d objects from the ground. In CVPR, pages 3793–3802, 2022.
Reading et al. [2021] Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. In CVPR, pages 8555–8564, 2021.
Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, pages 10529–10538, 2020.
Shi et al. [2023] Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023.
Shi et al. [2021] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based distance decomposition for monocular 3d object detection. In ICCV, pages 15172–15181, 2021.
Shi et al. [2022] Yuguang Shi, Yu Guo, Zhenqiang Mi, and Xinjie Li. Stereo centernet-based 3d object detection for autonomous driving. Neurocomputing, 471:219–229, 2022.
Simonelli et al. [2019] Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel López-Antequera, and Peter Kontschieder. Disentangling monocular 3d object detection. In ICCV, pages 1991–1999, 2019.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
Wang et al. [2021] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR, pages 454–463, 2021.
Xu et al. [2022] Qiangeng Xu, Yiqi Zhong, and Ulrich Neumann. Behind the curtain: Learning occluded shapes for 3d object detection. In AAAI, pages 2893–2901, 2022.
Yang et al. [2022] Fan Yang, Xinhao Xu, Hui Chen, Yuchen Guo, Jungong Han, Kai Ni, and Guiguang Ding. Ground plane matters: Picking up ground plane prior in monocular 3d object detection. arXiv preprint arXiv:2211.01556, 2022.
Yin et al. [2021] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In CVPR, pages 11784–11793, 2021.
Yu and Koltun [2015] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
Yu et al. [2018] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR, pages 2403–2412, 2018.
Zhang et al. [2023] Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth-guided transformer for monocular 3d object detection. In ICCV, pages 9155–9166, 2023.
Zhang et al. [2021] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are different: Flexible monocular 3d object detection. In CVPR, pages 3289–3298, 2021.
Zhang et al. [2022] Yunpeng Zhang, Wenzhao Zheng, Zheng Zhu, Guan Huang, Dalong Du, Jie Zhou, and Jiwen Lu. Dimension embeddings for monocular 3d object detection. In CVPR, pages 1589–1598, 2022.
Zhou et al. [2019] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
Zhou et al. [2021] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang. Monoef: Extrinsic parameter free monocular 3d object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):10114–10128, 2021.
Zhou et al. [2023] Yunsong Zhou, Hongzi Zhu, Quan Liu, Shan Chang, and Minyi Guo. Monoatt: Online monocular 3d object detection with adaptive token transformer. In CVPR, pages 17493–17503, 2023.
Zhu et al. [2023] Minghan Zhu, Lingting Ge, Panqu Wang, and Huei Peng. Monoedge: Monocular 3d object detection using local perspectives. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 643–652, 2023.

\thetitle

Supplementary Material

Appendix A Cross-Dataset Evaluation

To demonstrate the generalizability of our proposed method, we conduct cross-dataset evaluations on KITTI [9] and nuScenes [3] datasets. Following [32], our model is trained on the KITTI training set (3712 images), and evaluated on KITTI (3769 images) and nuScenes frontal (6019 images) validation sets. We also provide the results of retraining MonoCon [20] using the official code but unrestricted from training on distant objects ( $z>65m$ ) as a fair comparison with others. To fit the model trained in KITTI, for the nuScenes dataset, we adjusted the image resolution to 384×672 and the ground plane equation prediction preset height to 1.562m (the ego car height in nuScenes [3]). Neither our method nor MonoCon uses normalized coordinates for the direct depth prediction branch and the images of KITTI and nuScenes have different focal lengths which the direct depth prediction relies on. Thus, following [13], we divide their direct predicted depth by 1.361.

The cross-dataset evaluation results are shown in Tab. 5, our method has lower prediction errors at different object depth ranges, which indicates the effectiveness of the proposed complementary depths in improving overall accuracy. In addition, our method outperforms other methods in most of the metrics on both datasets, which demonstrates the generalizability of our method.

Appendix B Discussion on multi-depth prediction methods

Tab. 6 shows some representative multi-depth prediction methods in recent years. The coupling between their multiple branches is shown in the third column of Tab. 6 in terms of Error Sign Opposite Proportions (ESOP). MonoFlex [43] contains 4 depth prediction branches including 1 directly predicted depth and 3 depths shown in the 2nd row of Tab. 6. MonoGround [28] and our method have 3 additional depth branches on top of them. Since the results of the public branches are similar, for MonoGround and our method, Tab. 6 only shows the results of unshared branches.

It can be observed that the error sign of the 3 depths from keypoint and height is similar to the error sign of the directly predicted depths. Benefiting from the wider range of dense depth supervision, the coupling phenomenon of depths from the ground added by MonoGround [28] is mitigated a bit, but it does not eliminate the coupling. Because its dense supervision comes from local sampled values around the bottom of the object. Although the code of MonoDDE [18] has not been released, a similar coupling phenomenon can be inferred based on the local information it uses. However, after our complementary design, the coupling phenomenon is significantly alleviated and the overall performance is further improved.

Dataset	Method	Depth prediction MAE (meters) $\downarrow$
Dataset	Method	0-20	20-40	40- $\infty$	Alle
KITTI	M3D-RPN [1]	0.56	1.33	2.73	1.26
	MonoRCNN [32]	0.46	1.27	2.59	1.14
	GUPNet [23]	0.45	1.10	1.85	0.89
	MonoCon [20]	0.40	1.08	1.78	0.85
	MonoCD(Ours)	0.37	1.04	1.72	0.83
nuScenes	M3D-RPN [1]	0.94	3.06	10.36	2.67
	MonoRCNN [32]	0.94	2.84	8.65	2.39
	GUPNet [23]	0.82	1.70	6.20	1.45
	MonoCon [20]	0.78	1.65	6.02	1.40
	MonoCD(Ours)	0.73	1.59	5.78	1.33

Table 5: Cross-dataset evaluation on KITTI and nuScenes frontal validation with depth prediction MAE.

Model

Branch

dir&

ESOP

(%)

\uparrow

Val,

AP_{3D}

Mod.

\uparrow

MonoFlex [43]

key0

4.08

17.51

key1

5.22

key2

6.19

MonoGround [28]

gro0

18.35

18.69

gro1

20.72

gro2

14.73

MonoCD (Ours)

comp0

38.19

19.37

comp1

40.24

comp2

40.05

Table 6: Comparison between multiple depth prediction methods. The second column in the table represents the branches used to calculate ESOP with the directly(dir) predicted depth of each model. Including depths from keypoint and height (key), depths from ground (gro), and depths for complementary (comp). Different suffix numbers are used to distinguish the specific branches. The accuracy in the last column is

AP_{40}

for the moderate Car category at 0.7 IoU threshold on KITTI.

Appendix C Additional Experiments on the Effect of Complementary Depths

This section supplements the part of Sec. 3.2 in the main paper that is not presented in detail due to space limits. With the analyses in this section, two experimental conclusions can be obtained:

(1) Existing multiple predicted depths suffer from a common problem of lacking complementarity.

(2) To maximize the complementary effect, it is beneficial to keep prediction branches symmetrical in number.

C.1 Flip on Different Branch

Fipped Branch	Proportion of Flipped Samples
Fipped Branch	0%	25%	50%	75%	100%
dir	17.51	21.02	25.93	31.69	36.12
key0	17.51	21.06	25.78	31.26	35.87
key1	17.51	20.92	25.55	30.87	35.42
key2	17.51	20.85	25.33	29.76	34.92

Table 7: Perform flipping operation on different depth branches according to different sample proportions on KITTI dataset.

Model

Numbers of

Flipped

Branches

Val,

AP_{3D}

Mod.

\uparrow

MonoFlex [43]

17.51

25.93

35.79

22.95

15.55

MonoGround [28]

18.69

20.59

21.79

24.24

32.34

32.75

22.60

17.12

Table 8: Evaluation results of two multi-depth prediction models with different numbers of flipped branches on KITTI dataset, where the proportion of flipped samples is fixed at 50%.

As shown in Tab. 7, we perform flipping on different branches of MonoFlex [43] according to different flipped sample proportions. The first row of results in the table is presented to the left of Fig. 3 in the main paper. It can be observed that the results of selecting different branches for flipping are similar, which indicates that the coupling between multiple-depth branches is relatively similar and lacking complementarity is common.

C.2 Flip with Different Numbers of Branches

To maximize the complementary effects, we additionally conducted an analytical study on two multi-depth prediction models with different numbers of flipped branches. The results in Tab. 8 show that realizing branch flips with different numbers is effective in improving performance except in the case where all branches are flipped. This is because although the accuracy of the depth prediction does not change with flipping, the depth values will be completely flipped to the other side of the ground truth. According to Eq. (1) in the main paper, it introduces additional error to the predicted $x$ and $y$ , resulting in a decrease in the accuracy of the predicted 3D bounding box.

Furthermore, it is worth noting that both models perform best when the number of flipped branches and the number of unflipped branches are close to the same. This indicates that for multiple depth prediction branches with complementary effects, maintaining a certain level of symmetry in number is preferable to maximize their effectiveness. This is why we follow the number of $z_{key}$ and design three symmetrical $z_{comp}$ in the main paper.

Setting	Combined Depth prediction MAE (meters) $\downarrow$
	$y_{glo}$ MAE (meters) $\downarrow$					overall
	0-0.1	0.1-0.2	0.2-0.3	0.3-0.4	0.4- $\infty$
	Proportion of samples (%)
	54.09	27.37	9.61	4.77	4.15
Baseline	0.90	1.17	1.72	1.84	2.78	1.18
MonoCD(Ours)	0.85	1.13	1.66	1.82	3.02	1.14

Table 9: The system robustness evaluation in KITTI val set, which contains five levels based on the MAE of

y_{glo}

. The larger the value, the worse the conditions the system faces. The percentage under each level represents the proportion of samples.

Appendix D System Robustness Evaluation

As we discussed in the limitations of the main paper, the performance of our method is affected by the estimation of the ground plane equation and keypoints. Thus, we conduct a system robustness evaluation to check the performance of our method in severe conditions as shown in Tab. 9. For our added complementary depths, the effect of inaccuracies in ground plane estimation or keypoint detection is directly reflected in the prediction error of $y_{glo}$ . Therefore, we divide the samples into five levels according to the MAE of $y_{glo}$ and count the mean absolute error of the combined depth at each level. It can be observed that our method outperforms the baseline in most cases, and in a few severe conditions (less than 5%), the performance of our method degrades. This problem will be alleviated by enhancing the understanding of road scenes in the future.