RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

Ziying Song1    Guoxing Zhang2    Lin Liu1    Lei Yang3    Shaoqing Xu4    Caiyan Jia1 Corresponding author   
Feiyang Jia1
   Li Wang5 1School of Computer Science and Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University
2Hebei University of Science and Technology 3Tsinghua University 4University of Macau 5Beijing Institute of Technology {songziying, cyjia, feiyangjia}@bjtu.edu.cn
Abstract

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD). Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.

1 Introduction

Refer to caption
Figure 1: (a) We employ Gaussian distributions to represent the distributional disparities among the datasets. Indeed, there exists a large gap in data distribution between an OOD noise validation set and a clean validation set. Where the X-axis represents the set of mean pixel values in a dataset, X={xi|i=1,2,,N}𝑋conditional-setsubscript𝑥𝑖𝑖12𝑁X=\{x_{i}\,|\,i=1,2,...,N\}italic_X = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , 2 , … , italic_N }, with xi=1H×W×3i=1Hj=1Wk=13(Iijk)subscript𝑥𝑖1𝐻𝑊3superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊superscriptsubscript𝑘13subscript𝐼𝑖𝑗𝑘x_{i}=\frac{1}{H\times W\times 3}\sum_{i=1}^{H}\sum_{j=1}^{W}\sum_{k=1}^{3}(I_% {ijk})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H × italic_W × 3 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ), where N𝑁Nitalic_N is the number of the dataset, H𝐻Hitalic_H is the height, W𝑊Witalic_W is the width, and Iijksubscript𝐼𝑖𝑗𝑘I_{ijk}italic_I start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT denotes the pixel values for each image. (b) Visual foundation models (VFMs) like SAM Kirillov et al. (2023), show robust performance in many noisy scenarios. Yet, the current methods are not robust enough to predict 3D tasks for autonomous driving perception. (c) To this end, we propose a robust framework, RoboFusion, which employs VFMs at the SOTA multi-modal 3D object detection. Empirical results reveal that our method surpasses the Top-performing LoGoNetLi et al. (2023) on the KITTI Leaderboard by a margin of 23.12% mAP (Weather) on KITTI-C Dong et al. (2023) noisy scenarios. Notably, our RoboFusion shows better performance with LoGoNet Li et al. (2023) in clean KITTI Geiger et al. (2012) dataset.

Multi-modal 3D object detection plays a pivotal role in autonomous driving (AD) Wang et al. (2023a); Song et al. (2024a). Different modalities often provide complementary information. For instance, images contain richer semantic representations, yet lack depth information. In contrast, point clouds offer geometric and depth details, but they are sparse and lack semantic information. Therefore, effectively leveraging the advantages of multi-model while mitigating their limitations contributes to enhancing the robustness and accuracy of perception systems Song et al. (2023).

With the emergence of AD datasets Geiger et al. (2012); Caesar et al. (2020); Zhang et al. (2023c), state-of-the-art (SOTA) methods Liu et al. (2023); Bai et al. (2022); Chen et al. (2022); Huang et al. (2020); Li et al. (2023); Song et al. (2024b) on ‘clean’ datasets Geiger et al. (2012); Caesar et al. (2020) have achieved record-breaking performance. However, they overlook the exploration of robustness and generalization in out-of-distribution (OOD) scenarios Dong et al. (2023). For example, the KITTI dataset Geiger et al. (2012) lacks severe weather conditions. When SOTA methods Chen et al. (2022); Li et al. (2023); Liu et al. (2023) learn from these sunny weather datasets, can they truly generalize and maintain robustness in severe weather conditions like snow and fog?

The answer is ‘No’, as shown in Fig. 1 and verified in Table 3. People often utilize domain adaptation (DA) techniques to address these challenges Wang et al. (2023b); Tsai et al. (2023); Peng et al. (2023); Hu et al. (2023). Although DA techniques improve the robustness of 3D object detection and reduce the need for annotated data, they have some profound drawbacks, including domain shift limitations, label shift issues, and overfitting risks Oza et al. (2023). For instance, DA techniques may be constrained if the differences between two domains are significant, leading to performance degradation on the target domain.

Recently, both Natural Language Processing (NLP) and Computer Vision (CV) have witnessed the appearance and the power of a series of foundation models Kirillov et al. (2023); OpenAI (2023); Zhao et al. (2023); Zhang et al. (2023a), resulting in the emergence of new paradigms in deep learning. For example, a series of novel visual foundation models (VFMs) Kirillov et al. (2023); Zhao et al. (2023); Zhang et al. (2023a) have been developed. Thanks to their extensive training on huge datasets, these models exhibit powerful generalization capabilities. These developments have inspired new ideas, leveraging the robustness and generalization abilities of VFMs to achieve generalization in OOD noisy scenarios, much like how adults generalize knowledge when encountering new situations, without relying on DA techniques Wang et al. (2023b); Tsai et al. (2023).

Inspired by the success of VFMs in CV tasks, in this work, we intend to use these models to tackle the challenges of multi-modal 3D object detectors in OOD noise scenarios. Therefore, we propose a robust framework, RoboFusion, which leverages VFMs like SAM to adapt a 3D multi-modal object detector from clean scenarios to OOD noise scenarios. In particular, the adaptation strategies for SAM are as follows. 1) We utilize features extracted from SAM rather than inference segmentation results. 2) We propose SAM-AD, which is a pre-trained SAM for AD scenarios. 3) We introduce a novel AD-FPN to address the issue of feature upsampling for aligning VFMs with multi-modal 3D object detector. 4) To further reduce noise interference and retain essential signal features, we design a Depth-Guided Wavelet Attention (DGWA) module that effectively attenuates both high-frequency and low-frequency noises. 5) After fusing point cloud features and image features, we propose Adaptive Fusion to further enhance feature robustness and noise resistance through self-attention to re-weight the fused features adaptively. We validate RoboFusion’s robustness against OOD noise scenarios in KITTI-C and nuScenes-C datasets Dong et al. (2023), achieving SOTA performance amid noise, as shown in Fig. 1.

Refer to caption
Figure 2: The framework of RoboFusion. The LiDAR branch follows the baselines  Chen et al. (2022); Bai et al. (2022) to generate LiDAR features. In the camera branch, first, we extract robust image features using a highly optimized SAM-AD and acquire multi-scale features using AD-FPN. Second, the sparse depth map S𝑆Sitalic_S is generated by the raw points and fed into a depth encoder to obtain depth features and fused with multi-scale image features Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to obtain depth-guided image features F^isubscript^𝐹𝑖\hat{F}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then wave attention is used to remove the mutation noise. Finally, adaptive Fusion integrates point cloud features with robust image features with depth information via self-attention mechanism.
Refer to caption
Figure 3: An illustration of the pre-training framework. We corrupt a clean image x𝑥xitalic_x by η𝜂\etaitalic_η which contains multiple weather noises and then randomly masking several patches on a noisy image x+η𝑥𝜂x+\etaitalic_x + italic_η to obtain a masked noisy image Mask(x+η)𝑀𝑎𝑠𝑘𝑥𝜂Mask(x+\eta)italic_M italic_a italic_s italic_k ( italic_x + italic_η ). The SAM-AD and DMAE decoder are trained to reconstruct the clean image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG from Mask(x+η)𝑀𝑎𝑠𝑘𝑥𝜂Mask(x+\eta)italic_M italic_a italic_s italic_k ( italic_x + italic_η ).
Refer to caption
Figure 4: The architecture of Adaptive Fusion, which involves adaptively re-weighting the fused features using self-attention.

2 Related Work

2.1 Multi-Modal 3D Object Detection

Currently, multi-modal 3D object detection has received considerable attention on popular datasets Geiger et al. (2012); Caesar et al. (2020). BEVFusion Liu et al. (2023) fuse multi-modal representations in a unified 3D or BEV space. TransFusion Bai et al. (2022) builds a two-stage pipeline where proposals are generated based on LiDAR features and further refined using query image features. DeepInteraction Yang et al. (2022) and SparseFusion Xie et al. (2023) further optimize the camera branch on top of TransFusion. Previous methods are highly optimized to achieve the best performance on clean datasets. However, they ignore common factors in the real world (e.g., bad weather and sensor noise). In this work, we consider a real-world robustness perspective and design a robust multi-modal 3D perception framework, RoboFusion.

2.2 Visual Foundation Models for 3D Object Detection

Motivated by the success of Large Language Models (LLMs) OpenAI (2023), VFMs start to be explored in CV community. SAM Kirillov et al. (2023) leverages ViT Dosovitskiy et al. (2020) to train on the huge SA-1B dataset, containing 11 million samples, which enables SAM to be generalized to many scenes. Currently, there have been a few research endeavors aiming at integrating 3D object detectors with SAM. For instance, SAM3D Zhang et al. (2023b), as a LiDAR-only method, solely transforms LiDAR’s 3D perspective into a BEV (Bird’s Eye View) 2D space to harness the generalization capabilities of SAM, yielding sub-optimal performance on ‘clean’ datasets. Another in progress work, 3D-Box-Segment-Anything 111https://github.com/dvlab-research/3D-Box-Segment-Anything, tries to utilize SAM for 3D object detection. This indicates the highly attention of SAM like foundation models in 3D scenes in the literature. Our RoboFusion, as a multi-modal method, gives clear strategies to leverage the generalization capabilities of VFMs to address the OOD noise challenges inherent in existing 3D multi-modal object detection methods.

3 RoboFusion

In this section, we present RoboFusion, a framework that harnesses the robustness and generalization capabilities of VFMs such as SAM Kirillov et al. (2023) for multi-modal 3D object detection. The overall architecture is depicted in Fig. 2 and comprises the following components: 1) SAM-AD & AD-FPN module which obtains robust multi-scale image features, 2) Depth-Guided Wavelet Attention (DGWA) module which employs wavelet decomposition to denoise depth-guided image features, 3) Adaptive Fusion module which adaptively fuses point cloud features with image features.

3.1 SAM-AD & AD-FPN

Preliminaries. SAM Kirillov et al. (2023), a VFM, achieves generalization across diverse scenes due to its extensive training on the large-scale SA-1B dataset—with over 11 million samples and 1 billion high-quality masks. Currently, SAM family Kirillov et al. (2023); Zhao et al. (2023); Zhang et al. (2023a) primarily support 2D tasks. However, directly extending VFMs like SAM to 3D tasks presents a gap. To address this, we combine SAM with multi-modal 3D models, merging 2D robust feature representations with 3D point cloud features to achieve robust fused features.

SAM-AD. To further adapt SAM with AD (autonomous driving) scenarios, we perform pre-training on SAM to obtain SAM-AD. Specifically, we curate an extensive collection of image samples from well-established datasets (i.e., KITTI Geiger et al. (2012) and nuScenes Caesar et al. (2020)), forming the foundational AD dataset. Following DMAE Wu et al. (2023), we perform pre-training on SAM to obtain SAM-AD in AD scenarios, as shown in Fig. 3. We denote x𝑥xitalic_x as a clean image from the AD dataset (i.e. KITTI Geiger et al. (2012) and nuScenes Caesar et al. (2020)) and η𝜂\etaitalic_η as a set of noise images generated by Dong et al. (2023) based on x𝑥xitalic_x. And the noise type and the severity are randomly chosen from the four weather (i.e., rain, snow, fog, and strong sunlight) and the five severities from 1 to 5, respectively. We employ the image encoder of SAM Kirillov et al. (2023) , MobileSAM Zhang et al. (2023a) as our encoder while the decoder and the reconstruction loss are the same as DMAE Wu et al. (2023). For FastSAM Zhao et al. (2023), we adopt YOLOv8 222https://github.com/ultralytics/ultralytics to pre-train FastSAM on the AD dataset. To avoid overfitting, we use random resizing and cropping as data augmentation. We also set the mask ratio as 0.75 and have trained 400 epochs on 8 NVIDIA A100 GPUs.

AD-FPN. As a promptable segmentation model, SAM has three components: image encoder, prompt encoder and mask decoder. Generally, the image encoder can provide high-quality and highly robust image embedding for downstream models, while the mask decoder is only designed to provide decoding services for semantic segmentation. Furthermore, what we require are robust image features rather than the processing of prompting information by the prompt encoder. Therefore, we employ SAM’s image encoder to extract robust image features. However, SAM utilizes the ViT series Dosovitskiy et al. (2020) as its image encoder, which excludes multi-scale features and provides only high-dimensional low-resolution features. To generate the multi-scale features required for object detection, inspired by Li et al. (2022a), we design an AD-FPN that offers ViT-based multi-scale features. Specifically, leveraging height-dimensional image embedding with stride 16 (scale=1/161161/161 / 16) provided by SAM, we produce a series of multi-scale features Fmssubscript𝐹𝑚𝑠F_{ms}italic_F start_POSTSUBSCRIPT italic_m italic_s end_POSTSUBSCRIPT with stride of {32,16,8,4}321684\left\{32,16,8,4\right\}{ 32 , 16 , 8 , 4 }. Sequentially, we acquire multi-scale feature FiH4×W4×Cisubscript𝐹𝑖superscript𝐻4𝑊4subscript𝐶𝑖F_{i}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times C_{i}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT by integrate Fmssubscript𝐹𝑚𝑠F_{ms}italic_F start_POSTSUBSCRIPT italic_m italic_s end_POSTSUBSCRIPT in a bottom-up manner similar to FPN Lin et al. (2017).

3.2 Depth-Guided Wavelet Attention

Although SAM-AD or SAM has the capability to extract robust image features, the gap between 2D and 3D domains still persists and cameras lacking geometric information in a corrupted environment often amplify noise and give rise to negative migration issues. To mitigate this problem, we propose the Depth-Guided Wavelet Attention (DGWA) module, which can be split into two steps. 1) A depth-guided network is designed, that adds geometry prior to image features by combining image features and depth features from a point cloud. 2) The features of an image are decomposed into four wavelet subbands using the Haar wavelet transform Liu et al. (2020a), then attention mechanism allows to denoise informative features in the subbands.

Formally, given image features FiH4×W4×Cisubscript𝐹𝑖superscript𝐻4𝑊4subscript𝐶𝑖F_{i}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times C_{i}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and raw points PN,Cp𝑃superscript𝑁subscript𝐶𝑝P\in\mathbb{R}^{N,C_{p}}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N , italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as input. We project P𝑃Pitalic_P onto the image plane to acquire a sparse depth map SH×W×2𝑆superscript𝐻𝑊2S\in\mathbb{R}^{H\times W\times 2}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 end_POSTSUPERSCRIPT. Next, we feed S𝑆Sitalic_S into the depth encoder DE()𝐷𝐸DE(\cdot)italic_D italic_E ( ⋅ ), which consists of several convolution and max pooling blocks, to acquire depth features FdH4×W4×Cisubscript𝐹𝑑superscript𝐻4𝑊4subscript𝐶𝑖F_{d}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times C_{i}}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Afterward, we leverage convolution encode (Fi,Fd)subscript𝐹𝑖subscript𝐹𝑑(F_{i},F_{d})( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) to acquire depth-guided image features Fi^H4×W4×16^subscript𝐹𝑖superscript𝐻4𝑊416\hat{F_{i}}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times{16}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × 16 end_POSTSUPERSCRIPT, given by

Fi^=Conv(Concat(Fi,DE(S))).^subscript𝐹𝑖𝐶𝑜𝑛𝑣𝐶𝑜𝑛𝑐𝑎𝑡subscript𝐹𝑖𝐷𝐸𝑆\hat{F_{i}}=Conv(Concat(F_{i},DE(S))).over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_C italic_o italic_n italic_v ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D italic_E ( italic_S ) ) ) . (1)

Subsequently, we employ discrete wavelet transform (DWT), a reversible operator, to partition the input Fi^^subscript𝐹𝑖\hat{F_{i}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG into four subbands. Specifically, we encode the rows and columns of the input separately into one low-frequency band f~iLLH8×W8×4superscriptsubscript~𝑓𝑖𝐿𝐿superscript𝐻8𝑊84\widetilde{f}_{i}^{LL}\in\mathbb{R}^{\frac{H}{8}\times\frac{W}{8}\times 4}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 4 end_POSTSUPERSCRIPT and three high-frequency bands (f~iLH,f~iHL,f~iHH)H8,W8,4superscriptsubscript~𝑓𝑖𝐿𝐻superscriptsubscript~𝑓𝑖𝐻𝐿superscriptsubscript~𝑓𝑖𝐻𝐻superscript𝐻8𝑊84(\widetilde{f}_{i}^{LH},\widetilde{f}_{i}^{HL},\widetilde{f}_{i}^{HH})\in% \mathbb{R}^{\frac{H}{8},\frac{W}{8},4}( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_H end_POSTSUPERSCRIPT , over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_L end_POSTSUPERSCRIPT , over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_H end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 8 end_ARG , divide start_ARG italic_W end_ARG start_ARG 8 end_ARG , 4 end_POSTSUPERSCRIPT, with the low-filter fL=(12,12)subscript𝑓𝐿1212f_{L}=(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ) and the high-filter fH=(12,12)subscript𝑓𝐻1212f_{H}=(\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}})italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG , - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ). In this state, the low-frequency band retains coarse-grained information while the high-frequency band retains fine-grained information. In other words, it is easier to capture the mutation signal, so as to filter the noise information. We concatenate the four-subband features along channel dimension to acquire wavelet features Fi~=[f^iLL,f^iLH,f^iHL,f^iHH]H8×W8×16~subscript𝐹𝑖superscriptsubscript^𝑓𝑖𝐿𝐿superscriptsubscript^𝑓𝑖𝐿𝐻superscriptsubscript^𝑓𝑖𝐻𝐿superscriptsubscript^𝑓𝑖𝐻𝐻superscript𝐻8𝑊816\widetilde{F_{i}}=[\hat{f}_{i}^{LL},\hat{f}_{i}^{LH},\hat{f}_{i}^{HL},\hat{f}_% {i}^{HH}]\in\mathbb{R}^{\frac{H}{8}\times\frac{W}{8}\times 16}over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = [ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_L end_POSTSUPERSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_H end_POSTSUPERSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_L end_POSTSUPERSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_H end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 16 end_POSTSUPERSCRIPT. Next, we perform wave-attention Attω𝐴𝑡subscript𝑡𝜔Att_{\omega}italic_A italic_t italic_t start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT to query informative features in the wavelet features. Concretely, we employ F^isubscript^𝐹𝑖\hat{F}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a Query and Fi~~subscript𝐹𝑖\widetilde{F_{i}}over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG as a Key/Value given by

Fatt=Attω(Fi^,Fi~)=σ(Fi^Wq(Fi~Wk)TCi)Fi~Wv.subscript𝐹𝑎𝑡𝑡𝐴𝑡subscript𝑡𝜔^subscript𝐹𝑖~subscript𝐹𝑖𝜎^subscript𝐹𝑖superscript𝑊𝑞superscript~subscript𝐹𝑖superscript𝑊𝑘𝑇subscript𝐶𝑖~subscript𝐹𝑖superscript𝑊𝑣F_{att}=Att_{\omega}(\hat{F_{i}},\widetilde{F_{i}})=\sigma(\frac{\hat{F_{i}}W^% {q}(\widetilde{F_{i}}W^{k})^{T}}{\sqrt{C_{i}}})\widetilde{F_{i}}W^{v}.italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT = italic_A italic_t italic_t start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) = italic_σ ( divide start_ARG over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ) over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT . (2)

Finally, we leverage the IDWT (inverse DWT) to convert Fi~~subscript𝐹𝑖\widetilde{F_{i}}over~ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG back to Fi^^subscript𝐹𝑖\hat{F_{i}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG and integrate this converted Fi^^subscript𝐹𝑖\hat{F_{i}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG and Fattsubscript𝐹𝑎𝑡𝑡F_{att}italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT to obtain denoise features FoutH16×W16×16subscript𝐹𝑜𝑢𝑡superscript𝐻16𝑊1616F_{out}\in\mathbb{R}^{\frac{H}{16}\times\frac{W}{16}\times 16}italic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 16 end_ARG × divide start_ARG italic_W end_ARG start_ARG 16 end_ARG × 16 end_POSTSUPERSCRIPT by

Fout=MLP(Concat(Fatt,Fi^)),subscript𝐹𝑜𝑢𝑡𝑀𝐿𝑃𝐶𝑜𝑛𝑐𝑎𝑡subscript𝐹𝑎𝑡𝑡^subscript𝐹𝑖F_{out}=MLP(Concat(F_{att},\hat{F_{i}})),italic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ) , (3)

where Foutsubscript𝐹𝑜𝑢𝑡F_{out}italic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT preserves informative features and restrains redundant mutation noise in the frequency domain.

3.3 Adaptive Fusion

Following the incorporation of image depth features within the DGWA module, we propose the Adaptive Fusion technique to combine point cloud attributes with robust image features enriched with depth information. Specifically, different types of noise affect LiDAR and images to different degrees, which raises a corruption imbalance problem. Therefore, considering the distinct influences of various noises on LiDAR and camera, we employ self-attention to re-weight the fused features adaptively as shown in Fig. 4. The corruption degree of modality-specificity is dynamic, and self-attention mechanism allows adaptive re-weighting features to enhance informative features and suppress redundant noise.

4 Experiments

4.1 Datasets

We perform experiments on both the clean public benchmarks (KITTI Geiger et al. (2012) and nuScenes Caesar et al. (2020)) and the noisy public benchmarks (KITTI-CDong et al. (2023) and nuScenes-C Dong et al. (2023)).

4.1.1 KITTI

The KITTI dataset provides synchronized LiDAR point clouds and front-view camera images, consists of 3,712 training samples, 3,769 validation samples and 7,518 test samples. The standard evaluation metric for object detection is the mean Average Precision (mAP), computed using recall at 40 positions (R40).

4.1.2 nuScenes

The nuScenes dataset is a large-scale 3D detection benchmark consisting of 700 training scenes, 150 validation scenes, and 150 testing scenes. The data are collected using six multi-view cameras and a 32-channel LiDAR sensor. It includes 360-degree object annotations for 10 object classes. To evaluate the detection performance, the primary metrics used are the mean Average Precision (mAP) and the nuScenes detection score (NDS).

4.1.3 KITTI-C and nuScenes-C

In terms of data robustness, Dong et al. (2023) has designed 27 types of common corruptions for both LiDAR and camera, with the aim of benchmarking the corruption robustness of existing 3D object detectors. Dong et al. (2023) has established corruption robustness benchmarks 333https://github.com/thu-ml/3D_Corruptions_AD, including KITTI-C and nuScenes-C, by synthesizing corruptions on public datasets. Specifically, we utilize KITTI-C and nuScenes-C in our work. It is worth noting that Ref. Dong et al. (2023) has only added noise to the validation dataset and kept the train and test datasets clear.

Table 1: Comparison with SOTA methods on KITTI validation and test sets for car class with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT.
Method AP(%)3D{}_{3D}(\%)start_FLOATSUBSCRIPT 3 italic_D end_FLOATSUBSCRIPT ( % ) (validation set) AP(%)3D{}_{3D}(\%)start_FLOATSUBSCRIPT 3 italic_D end_FLOATSUBSCRIPT ( % ) (test set)
mAP Easy Mod. Hard mAP Easy Mod. Hard
Voxel R-CNN 86.84 92.38 85.29 82.86 83.19 90.90 81.62 77.06
VFF 86.91 92.31 85.51 82.92 83.62 89.50 82.09 79.29
CAT-Det 83.58 90.12 81.46 79.15 82.62 89.87 81.32 76.68
LoGoNet 87.13 92.04 85.04 84.31 85.87 91.80 85.06 80.74
Focals Conv-F - - - - 83.47 90.55 82.28 77.59
Baseline* 86.75 92.05 85.51 82.70 - - - -
RoboFusion-L 88.87 93.30 88.04 85.27 85.58 91.75 84.08 80.71
RoboFusion-B 88.45 93.22 87.87 84.27 85.32 91.98 83.76 80.23
RoboFusion-T 88.08 93.28 87.60 83.36 85.09 91.68 83.70 79.89
  • 1

    \ast denotes our reproduced results based on the officially released codes.

Table 2: Comparison with SOTA methods on nuScenes validation and test sets.
Method LiDAR Camera validation set test set
NDS mAP NDS mAP
FUTR3D VoxelNet ResNet-101 68.3 64.5 - -
BEVFusion-mit VoxelNet Swin-T 71.4 68.5 72.9 70.2
DeepInteraction VoxelNet ResNet-50 72.6 69.9 73.4 70.8
CMT VoxelNet ResNet-50 72.9 70.3 74.1 72.0
SparseFusion VoxelNet ResNet-50 72.8 70.4 73.8 72.0
TransFusion VoxelNet ResNet-50 71.3 67.5 71.6 68.9
Baseline* VoxelNet ResNet-50 70.8 67.3 - -
RoboFusion-L VoxelNet SAM 72.1 69.9 72.0 69.9
RoboFusion-B VoxelNet FastSAM 71.9 69.4 71.8 69.4
RoboFusion-T VoxelNet MobileSAM 71.3 69.1 71.5 69.1
  • 1

    \ast denotes our reproduced results based on the officially released codes.

Table 3: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at moderate difficulty. ‘S.L.’, ‘D.’, ‘C.O.’, and ‘C.T.’ denotes Strong Sunlight, Density, Cutout, and Crosstalk, respectively.
Rai Clean Weather Sensor
mAP Snow Rain Fog S.L. D. C.O. C.T.
SECOND 81.59 64.33 52.34 52.55 74.10 78.32 80.18 73.59 80.24
PointPillars 78.41 49.80 36.47 36.18 64.28 62.28 76.49 70.28 70.85
PointRCNN 80.57 59.14 50.36 51.27 72.14 62.78 80.35 73.94 71.53
PV-RCNN 84.39 65.83 52.35 51.58 79.47 79.91 82.79 76.09 82.34
SMOKE 7.09 4.51 2.47 3.94 5.63 6.00 - - -
ImVoxelNet 11.49 3.22 0.22 1.24 1.34 10.08 - - -
EPNet 82.72 46.21 34.58 36.27 44.35 69.65 82.09 76.10 82.10
Focals Conv-F 85.88 50.40 34.77 41.30 44.55 80.97 84.95 78.06 85.82
LoGoNet* 85.04 62.58 51.45 55.80 67.53 75.54 83.68 77.17 82.00
RoboFusion-L 88.04 85.70 85.29 86.48 85.53 85.50 85.71 83.17 84.12
RoboFusion-B 87.87 84.70 84.11 85.54 84.00 85.15 84.34 81.30 82.45
RoboFusion-T 87.60 84.60 84.67 84.79 84.17 84.75 84.11 81.21 83.07
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

Table 4: Comparison with SOTA methods on nuScenes-C validation set with mAP. ‘S.L.’, ‘D.’, ‘C.O.’, and ‘C.T.’ denotes Strong Sunlight, Density, Cutout, and Crosstalk, respectively.
Method Clean Weather Sensor
mAP Snow Rain Fog S.L. D. C.O. C.T.
PointPillars 27.69 25.87 27.57 27.71 24.49 23.71 27.27 24.14 25.92
SSN 46.65 43.70 46.38 46.50 41.64 40.28 46.14 40.95 44.08
CenterPoint 59.28 52.49 55.90 56.08 43.78 54.20 58.60 56.28 56.64
FCOS3D 23.86 11.44 2.01 13.00 13.53 17.20 - - -
PGD 23.19 12.85 2.30 13.51 12.83 22.77 - - -
DETR3D 34.71 22.00 5.08 20.39 27.89 34.66 - - -
BEVFormer 41.65 26.29 5.73 24.97 32.76 41.68 - -
FUTR3D 64.17 55.50 52.73 58.40 53.19 57.70 63.72 62.25 62.66
TransFusion 66.38 58.87 63.30 63.35 53.67 55.14 65.77 63.66 64.67
BEVFusion 68.45 61.87 62.84 66.13 54.10 64.42 67.79 66.18 67.32
DeepInteraction 69.90 62.14 62.36 66.48 54.79 64.93 68.15 66.23 68.12
CMT 70.28 63.46 62.56 61.44 66.26 63.59 69.65 68.70 68.26
RoboFusion-L 69.91 67.24 67.12 67.58 67.01 67.24 69.48 69.18 68.68
RoboFusion-B 69.40 66.33 66.07 67.01 65.54 66.71 69.02 69.01 68.04
RoboFusion-T 69.09 65.82 65.96 66.45 64.34 66.54 68.58 68.20 68.17
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

Table 5: Performance of different VFMs on RoboFusion. ‘RCE’ denotes Relative Corruption Error Dong et al. (2023). ‘mAP (Weather)’ denotes the average value across four types of weather corruptions, Snow, Rain, Fog, and Strong Sunlight.
Method Model Size FPS (A100) mAP (Weather ) mAP (Clean) RCE (%)
RoboFusion-L 97.54M 3.1 67.24 69.91 0.04
RoboFusion-B 81.01M 3.5 66.33 69.40 0.04
RoboFusion-T 13.94M 6.0 65.82 69.09 0.05
DeepInteraction 57.82M 4.9 62.14 69.90 0.10
TransFusion 36.96M 6.2 58.37 66.38 0.12

4.2 Experimental Settings

4.2.1 Network Architecture.

Our RoboFusion consists of three variants: RoboFusion-L, RoboFusion-B, and RoboFusion-T, which utilize the models SAM-B Kirillov et al. (2023), FastSAM Zhao et al. (2023), and MobileSAM Zhang et al. (2023a), respectively. It is noteworthy that due to the convolutional operations of FastSAM in RoboFusion-B which is capable of generating multi-scale features, the AD-FPN module is not employed. Since KITTI and nuScenes are distinct datasets with varying evaluation metrics and characteristics, we provide a detailed description of our RoboFusion settings for each dataset.

RoboFusion in KITTI and KITTI-C. We validate our RoboFusion on the KITTI dataset using Focals Conv Chen et al. (2022) as the baseline. The input voxel size is set to (0.05m, 0.05m, 0.1m), with anchor sizes for cars at [3.9, 1.6, 1.56] and anchor rotations at [0, 1.57]. We adopt the same data augmentation solution as Focals Conv-F.

RoboFusion with nuScenes and nuScenes-C. We validate our RoboFusion on the nuScenes dataset using TransFusion Bai et al. (2022) as the baseline. The detection range for the X and Y axis is set at [-54m, 54m] and [-5m, 3m] for the Z axis. The input voxel size is set at (0.075m, 0.075m, 0.2m), and the maximum number of point clouds contained in each voxel is set to 10. It is noteworthy that the Adaptive Fusion module is applied exclusively to Focals Conv rather than TransFusion, while TransFusion uses its own fusion module.

4.2.2 Training and Testing Details.

Our RoboFusion is meticulously trained from scratch using the Adam optimizer and incorporates several foundation models as image encoders including SAM, FastSAM and MobileSAM. To enable effective training on the KITTI and nuScenes datasets, we utilize 8 NVIDIA A100 GPUs for network training. Additionally, the runtime is evaluated on an NVIDIA A100 GPU. Specifically, for KITTI, our RoboFusion based on Focals ConvChen et al. (2022) involves training for 80 epochs. For nuScenes, our RoboFusion based on TransFusion Bai et al. (2022) has 20 epochs of training. During the model inference stage, we employ a non-maximal suppression (NMS) operation in the Region Proposal Network (RPN) with an IoU threshold of 0.7. We select the top 100 region proposals to serve as inputs for the detection head. After refinement, we apply NMS again with an IoU threshold of 0.1 to eliminate redundant predictions. For additional details regarding our method, please refer to OpenPCDet 444https://github.com/open-mmlab/OpenPCDet .

4.3 Comparing with state-of-the-art

We conduct evaluations on the clean datasets KITTI and nuScenes, as well as the noisy datasets KITTI-C and nuScenes-C. While SOTA methods are primarily focused on achieving high accuracy, we place greater emphasis on the robustness and generalization of the methods. These factors are crucial for the practical deployment of 3D object detection in AD scenarios, making the evaluation on the noisy datasets more important in our perspective.

4.3.1 Results on the clean benchmark.

As shown in Table 1, we compare our RoboFusion with SOTA methods, including Voxel R-CNN Deng et al. (2021), VFF Li et al. (2022b), CAT-DetZhang et al. (2022) , Focals Conv-F Chen et al. (2022), and LoGoNet Li et al. (2023) on the KITTI validation and test sets. As shown in Table 2, we also compare our RoboFusion with SOTA methods, including FUTR3D Chen et al. (2023), TransFusion Bai et al. (2022), BEVFusion Liu et al. (2023), DeeepInteraction Yang et al. (2022), CMT Yan et al. (2023) and SparseFusion Xie et al. (2023), on the nuScenes test and validation sets. Our RoboFusion has achieved SOTA performance on the clean benchmarks (KITTI and nuScenes).

4.3.2 Results on the noisy benchmark.

In the real-world AD scenarios, the distribution of data often differs from that of training or testing data, as shown in Fig. 1 (a). Specifically, Ref. Dong et al. (2023) provides a novel noisy benchmark that includes KITTI-C and nuScenes-C, which we primarily use to evaluate the weather and sensor noise corruptions, including rain, snow, fog, and strong sunlight, density, cutout, and so on. In addition, comparisons of our RoboFusion with SOTA methods in other settings are presented in the Appendix 555https://arxiv.org/abs/2401.03907.

As shown in Table 3, SOTA methods, including SECOND Yan et al. (2018), PointPillars Lang et al. (2019), PointRCNN Shi et al. (2019), PV-RCNN Shi et al. (2020), SMOKE Liu et al. (2020b), ImVoxelNet Rukhovich et al. (2022), EpNetHuang et al. (2020), Focals Conv-FChen et al. (2022), and LoGoNetLi et al. (2023), experience a significant decrease in performance on the noisy scenarios, particularly for weather conditions such as snow and rain. It can be attributed to the fact that the ‘clean’ KITTI dataset does not include examples in snowy or rainy weather. On the other hand, VFMs like SAM-AD have been trained on a diverse range of data and exhibit robustness and generalization to OOD scenarios, leading to higher performance on our RoboFusion metric. Furthermore, multi-modal methods like LoGoNet, and Focals Conv-F demonstrate better robustness and generalization in sensor noise scenarios, while LiDAR-only methods like PV-RCNN Shi et al. (2020) are more robust in weather noise scenarios. This observation motivates our research on adaptive fusion schemes for point cloud and image features. Overall, in the KITTI-C Dong et al. (2023) dataset, our RoboFusion’s performance is nearly on par with the clean scene, indicating high level of robustness and generalization.

As shown in Table 4, SOTA methods including PointPillars Lang et al. (2019), SSN Zhu et al. (2020), CenterPoint Yin et al. (2021), FCOS3D Wang et al. (2021), PGD Wang et al. (2022a), DETR3D Wang et al. (2022b), BEVFormer Li et al. (2022c), FUTR3D Chen et al. (2023), TransFusion Bai et al. (2022), BEVFusionLiu et al. (2023), DeepInteraction Yang et al. (2022) and CMT Yan et al. (2023) in nuScenes-C show relatively higher robustness than in KITTI-C when faced with weather noise. However, BEVFusion performs well in the presence of snow, rain, and strong sunlight noise but experiences a significant performance drop in foggy scenarios. In contrast, our method exhibits strong robustness and generalization in both weather and sensor noise scenarios in nuScenes-C.

Table 6: Impacts of different SAM usages on KITTI and KITTI-C validation sets for car class with AP of R40. ‘S.L.’ denotes Strong Sunlight.
Solution AP(%)3D{}_{3D}(\%)start_FLOATSUBSCRIPT 3 italic_D end_FLOATSUBSCRIPT ( % ) AP(%)Weather{}_{Weather}(\%)start_FLOATSUBSCRIPT italic_W italic_e italic_a italic_t italic_h italic_e italic_r end_FLOATSUBSCRIPT ( % )
mAP Easy Mod. Hard Snow Rain Fog S.L.
Offline 80.41 88.76 77.38 75.11 - - - -
No optim 86.45 91.86 84.80 82.71 45.11 47.77 63.10 79.21
Optim 88.00 92.41 86.77 84.81 57.43 54.27 68.81 82.07
Table 7: Influence of pre-training on SAM at KITTI-C validation set for car class with AP of R40 at moderate difficulty. ‘S.L.’, ‘D.’, ‘C.O.’, and ‘C.T.’ denotes Strong Sunlight, Density, Cutout, and Crosstalk, respectively.
VFM Weather Sensor
Snow Rain Fog S.L. D. C.O. C.T.
SAM 57.43 54.27 68.81 82.07 84.21 83.04 84.06
SAM-AD 80.68 81.68 81.67 83.48 84.71 84.17 84.12
Table 8: Roles of SAM3DFusion-L modules on KITTI-C validation set for car class with AP of R40 at moderate difficulty. ‘A.F.’ denotes Adaptive Fusion module. ‘S.L.’ denotes strong sunlight.
Method SAM-AD AD-FPN DGWA A.F. Snow Rain Fog S.L. FPS(A100)
a) 34.77 41.30 44.55 80.97 10.8
b) 80.68 81.68 81.67 83.48 4.0
c) 82.32 83.60 82.39 83.98 3.6
d) 83.99 85.63 84.01 84.81 3.4
e) 85.29 86.48 85.53 85.50 3.1

4.4 Ablation Study

4.4.1 Performance of Different VFMs on RoboFusion.

In order to analyze the noise robustness and FPS performance of different-sized VFMs, SAM, FastSAM and MobileSAM, we conduct comparative experiments of RoboFusion-L, RoboFusion-B and RoboFusion-T with SOTA methods, DeepInteraction Yang et al. (2022) and TransFusion Bai et al. (2022), on the nuScenes-C Dong et al. (2023) validation set, as shown in Table 5. Specifically, our RoboFusion exhibits remarkable robustness to weather noise scenarios. Furthermore, our RoboFusion-T has a similar FPS to TransFusion Bai et al. (2022). Overall, we have presented a viable application of SAM in 3D object detection tasks.

4.4.2 Impacts of Different SAM usages.

As shown in Table 6, our RoboFusion-L is experimented upon. Specifically, the first row is the offline usage, which involves loading pre-saved image features during training. It implies that certain online data augmentation cannot be utilized. The second (No optim) and the third (Optim) rows are online usages, where the former omits fine-tuning and keeps the model parameters fixed, the latter follows fine-tuning and updating. Therefore, offline usage perform worse than online usages. Additionally, fine-tuning the weights of SAM has demonstrated superior performance, resulting in a performance improvement in the presence of snow, rain, and fog noise scenarios.

4.4.3 Influence of Pre-training on SAM.

As shown in Table 7, to investigate the scientific value of pre-trained VFMs like SAM, FastSAM, and MobileSAM in AD scenarios, we conduct our RoboFusion-L with SAM evaluation on SAM and SAM-AD. Through pre-training, SAM-AD has gained a better understanding of AD scenarios than the original SAM. The pre-training strategy effectively improves the performance of our RoboFusion, demonstrating a significant improvement in the snow, rain, and fog noise scenarios.

4.4.4 Roles of Different Modules in RoboFusion.

As shown in Table 8, we present ablation experiments for different modules of our RoboFusion-L, built upon SAM-AD, including AD-FPN, DGWA, and Adaptive Fusion. Leveraging the strong capabilities of SAM-AD in AD scenarios, SAM-AD has a significant improvement from baseline Focals Conv Chen et al. (2022) (34.77%, 41.30%, 44.55%, 80.97%) to (80.68%, 81.68%, 81.67%, 83.48%). Subsequently, AD-FPN, DGWA, and Adaptive Fusion achieve even higher performance on the foundation of SAM-AD. This further highlights the substantial contributions of diverse modules within our RoboFusion framework in addressing OOD noise scenarios in AD.

5 Conclusions

In this work, we propose a robust framework RoboFusion to enhance the robustness and generalization of multi-modal 3D object detectors using VFMs like SAM, FastSAM, and MobileSAM. Specifically, we pre-train SAM for AD scenarios, yielding SAM-AD. To align SAM or SAM-AD with multi-modal 3D object detectors, we introduce AD-FPN for feature upsampling. To further mitigate noise and weather interference, we apply wavelet decomposition for depth-guided image denoising. Subsequently, we utilize self-attention mechanisms to adaptively reweight fused features, enhancing informative attributes and suppressing excess noises. Extensive experiments demonstrate that our RoboFusion effectively integrates VFMs to boost feature robustness and address OOD noise challenges. We anticipate this work to lay a strong foundation for future research on building robust and dependable foundation AD models.

Limitation and Future Work. First, RoboFusion has a heavy reliance on the representation capability of VFMs. This raises the baseline models’ generalization ability, but increases their complexities. Second, the inference speed of RoboFusion-L and RoboFusion-B is relatively slow due to the limitations of SAM and FastSAM. However, the inference speed of RoboFusion-T is competitive with some SOTA methods (e.g. TransFusion) without VFMs. In the future, for improving the real-time application ability of VFMs, we will attempt to incorporate SAM only in the training phase to guide a fast-speed student model, meanwhile explore more noise scenarios.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (2023YJS019), the National Key R&D Program of China (2018AAA0100302).

Appendix A Appendix

A.1 Broader Impacts

Our work aims to develop a robust framework to address out-of-distribution (OOD) noise scenarios in autonomous driving (AD). To the best of our knowledge, RoboFusion is the first method that leverages the generalization capabilities of visual foundation models (VFMs) like SAM Kirillov et al. (2023), FastSAM Zhao et al. (2023), and MobileSAM Zhang et al. (2023a) for multi-modal 3D object detection. Although existing multi-model 3D object detection methods achieve the state-of-the-art (SOTA) performance of ‘clean’ datasets, they overlook the robustness of real-world scenariosSong et al. (2024a). Therefore, we believe it is valuable to combine VFMs and multi-modal 3D object detection to mitigate the impact of OOD noise scenarios.

A.2 More Results

A.2.1 Specific classes AP on the nuScenes-C validation set.

As shown in Table 9, we present a comparison of Specific classes AP between TransFusion and our RoboFusion-L on the nuScenes-C validation set, encompassing scenarios with snow, rain, fog, and strong sunlight noise. It is evident from the results that RoboFusion-L exhibits superior performance.

Table 9: Comparison with TransFusion on nuScenes validation ‘Snow, Rain, Fog, and Strong Sunlight’ noisy scenarios. ‘T.F.’, ‘R.F.’, ‘S.L.’, ‘C.V.’, ‘Motor.’, ‘Ped.’, and ‘T.C.’ are short for TransFusion, RoboFusion-L, Strong Sunlight, construction vehicle, motorcycle, pedestrian, and traffic cone, respectively.
mAP Car Truck C.V. Bus Trailer Barrier Motor. Bike Ped. T.C.
Snow T.F. 63.30 84.55 58.41 25.50 62.31 56.00 70.19 69.98 43.69 84.24 78.16
R.F. 67.12 87.21 60.88 29.47 67.45 58.99 75.12 71.45 48.28 86.23 86.12
+3.82 +5.14 +4.93 +4.59 +7.96
Rain T.F. 63.35 85.37 56.87 25.12 64.65 55.10 71.99 68.21 44.13 83.87 78.14
R.F. 67.58 86.79 60.44 30.21 65.41 58.12 75.47 71.39 50.87 88.91 88.23
+4.23 +5.09 +6.74 +5.04 +10.09
Fog T.F. 53.67 80.23 48.51 18.04 50.69 53.03 62.24 54.53 25.27 80.63 66.48
R.F. 67.01 87.56 59.03 29.36 66.10 57.23 74.33 72.01 49.50 87.08 87.91
+13.34 +10.52 +11.32 +15.41 +12.09 +17.48 +24.23 +21.43
S.L. T.F. 55.14 81.99 48.07 19.78 51.09 52.57 63.68 55.09 26.98 82.68 69.49
R.F. 67.24 87.67 57.74 31.00 64.29 58.94 75.23 70.23 50.82 88.70 87.82
+12.10 +11.22 +13.20 +11.55 +15.14 +23.30 +18.33

A.2.2 Roles of Different Modules in RoboFusion.

To assess the roles of different modules in RoboFusion, we conduct an ablation study on the original SAM rather than SAM-AD, as shown in Table 10, where a) is the results of the baseline Chen et al. (2022), b)-e) shows the performance of our RoboFusion-L under different modules. According to Table 10, SAM and AD-FPN modules significantly improve the performance in OOD noisy scenarios. It is worth noticing that DGWA module significantly improves the performance, especially in snow noisy scenarios. By Table 10, the impact of fog noise on point clouds is relatively minor. But, using A.F. (Adaptive Fusion) module to dynamically aggregate point cloud features and image features exhibits significant enhancements in fog-noise scenarios.

Table 10: Roles of RoboFusion modules on KITTI-C validation set for car class with AP of R40 at moderate difficulty. ‘A.F.’ denotes Adaptive Fusion module. ‘S.L.’ denotes Strong Sunlight.
Method SAM AD-FPN DGWA A.F. Snow Rain Fog S.L.
a) 34.77 41.30 44.55 80.97
b) 57.43 54.27 68.81 82.07
c) 59.81 56.59 69.68 83.20
d) 66.45 58.11 70.53 84.01
e) 68.47 59.07 74.38 84.07
Table 11: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at moderate difficulty. The best one is highlighted in bold.‘S.L.’ denote Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.Dong et al. (2023).
Corruptions LiDAR-Only Camera-Only LC Fusion
SECOND PointPillars PointRCNN PV-RCNN SMOKE ImVoxelNet EPNet Focals Conv LoGoNet * RoboFusion (Ours)
L B T
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 81.59 78.41 80.57 84.39 7.09 11.49 82.72 85.88 85.04 88.04 87.87 87.60
Snow 52.34 36.47 50.36 52.35 2.47 0.22 34.58 34.77 51.45 85.29 84.70 84.60
Rain 52.55 36.18 51.27 51.58 3.94 1.24 36.27 41.30 55.80 86.48 85.54 84.79
Fog 74.10 64.28 72.14 79.47 5.63 1.34 44.35 44.55 67.53 85.53 84.00 84.17
Weather S.L. 78.32 62.28 62.78 79.91 6.00 10.08 69.65 80.97 75.54 85.50 85.15 84.75
Density 80.18 76.49 80.35 82.79 - - 82.09 84.95 83.68 85.71 84.34 84.11
Cutout 73.59 70.28 73.94 76.09 - - 76.10 78.06 77.17 83.17 81.30 81.21
Crosstalk 80.24 70.85 71.53 82.34 - - 82.10 85.82 82.00 84.12 82.45 83.07
Gaussian (L) 64.90 74.68 61.20 65.11 - - 60.88 82.14 61.85 76.56 78.32 76.52
Uniform (L) 79.18 77.31 76.39 81.16 - - 79.24 85.81 82.94 85.05 83.04 84.11
Impulse (L) 81.43 78.17 79.78 82.81 - - 81.63 85.01 84.66 85.26 85.06 85.46
Gaussian (C) - - - - 1.56 2.43 80.64 80.97 84.29 82.16 84.63 82.17
Uniform (C) - - - - 2.67 4.85 81.61 83.38 84.45 83.30 85.20 83.30
Sensor Impulse (C) - - - - 1.83 2.13 81.18 80.83 84.20 83.51 84.55 82.91
Moving Obj. 52.69 50.15 50.54 54.60 1.67 5.93 55.78 49.14 14.44 49.30 49.12 49.90
Motion Motion Blur - - - - 3.51 4.19 74.71 81.08 84.52 84.17 84.56 84.18
Local Density 75.10 69.56 74.24 77.63 - - 76.73 80.84 78.63 83.21 82.53 83.22
Local Cutout 68.29 61.80 67.94 72.29 - - 69.92 76.64 64.88 77.22 75.27 76.23
Local Gaussian 72.31 76.58 69.82 70.44 - - 75.76 82.02 55.66 79.02 78.32 78.33
Local Uniform 80.17 78.04 77.67 82.09 - - 81.71 84.69 79.94 84.69 83.70 84.37
Object Local Impulse 81.56 78.43 80.26 84.03 - - 82.21 85.78 84.29 85.26 85.08 85.06
Average(APcor)AveragesubscriptAPcor\textbf{Average}(\text{AP}_{\text{cor}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 71.68 66.34 68.76 73.41 3.25 3.60 70.35 74.43 71.89 81.72 81.31 81.12
RCE (%) \downarrow 12.14 15.38 14.65 13.00 54.11 68.65 14.94 13.32 15.46 7.17 7.46 7.38
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

Table 12: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at easy difficulty. The best one is highlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.Dong et al. (2023).
Corruptions Lidar-Only Camera-Only LC Fusion
SECOND PointPillars PointRCNN PV-RCNN SMOKE ImVoxelNet EPNet Focals Conv LoGoNet * RoboFusion (Ours)
L B T
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 90.53 87.75 91.65 92.10 10.42 17.85 92.29 92.00 92.04 93.30 93.22 93.28
Snow 73.05 55.99 71.93 73.06 3.68 0.30 48.03 53.80 74.24 88.77 88.18 88.31
Rain 73.31 55.17 70.79 72.37 5.66 1.77 50.93 61.44 75.96 88.12 88.57 87.75
Fog 85.58 74.27 85.01 89.21 8.06 2.37 64.83 68.03 86.60 88.96 88.16 88.09
Weather S.L. 88.05 67.42 64.90 87.27 8.75 15.72 81.77 90.03 80.30 89.79 89.23 90.36
Density 90.45 86.86 91.33 91.98 - - 91.89 91.14 91.85 92.90 92.08 92.12
Cutout 81.75 78.90 83.33 83.40 - - 84.17 83.84 84.20 85.94 85.75 84.75
Crosstalk 89.63 78.51 77.38 90.52 - - 91.30 92.01 88.15 91.71 91.54 92.07
Gaussian (L) 73.21 86.24 74.28 74.61 - - 66.99 88.56 64.62 80.96 84.30 83.23
Uniform (L) 89.50 87.49 89.48 90.65 - - 89.70 91.77 90.75 92.89 91.28 91.63
Impulse (L) 90.70 87.75 90.80 91.91 - - 91.44 92.10 91.66 91.90 91.95 92.30
Gaussian (C) - - - - 2.09 3.74 91.62 89.51 91.64 91.94 92.08 91.57
Uniform (C) - - - - 3.81 7.66 91.95 91.20 91.84 92.01 92.14 92.93
Sensor Impulse (C) - - - - 2.57 3.35 91.68 89.90 91.65 91.96 92.04 91.33
Moving Obj. 62.64 58.49 59.29 63.36 2.69 9.63 66.32 54.57 16.83 53.09 51.94 51.70
Motion Motion Blur - - - - 5.39 6.75 89.65 91.56 91.96 91.99 92.09 92.06
Local Density 87.74 82.90 88.37 89.60 - - 89.40 89.60 89.00 92.02 92.42 92.42
Local Cutout 81.29 75.22 83.30 84.38 - - 82.40 85.55 77.57 87.30 87.49 87.79
Local Gaussian 82.05 87.69 82.44 77.89 - - 85.72 89.78 60.03 89.56 89.41 89.62
Local Uniform 90.11 87.83 89.30 90.63 - - 91.32 91.88 88.51 91.59 91.53 91.75
Object Local Impulse 90.58 87.84 90.60 91.91 - - 91.67 92.02 91.34 92.09 91.97 90.69
Average(APcor)AveragesubscriptAPcor\textbf{Average}(\text{AP}_{\text{cor}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 83.10 77.41 80.78 83.92 4.74 5.69 81.63 83.91 80.93 88.27 88.20 88.12
RCE(%)\downarrow 8.20 11.78 11.85 8.87 54.46 68.07 11.54 8.78 12.07 5.39 5.39 5.53
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

Table 13: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at hard difficulty. The best one is hightlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.Dong et al. (2023).
Corruptions Lidar-Only Camera-Only LC Fusion
SECOND PointPillars PointRCNN PV-RCNN SMOKE ImVoxelNet EPNet Focals Conv LoGoNet * RoboFusion (Ours)
L B T
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 78.57 75.19 78.06 82.49 5.57 9.20 80.16 83.36 84.31 85.27 84.27 83.36
Snow 48.62 32.96 45.41 48.62 1.92 0.20 32.39 30.41 45.57 64.26 62.49 62.74
Rain 48.79 32.65 45.78 48.20 3.16 0.99 34.69 35.71 50.12 66.07 64.89 63.18
Fog 68.93 58.19 68.05 75.05 4.56 1.03 38.12 39.50 60.47 80.03 78.37 77.29
Weather S.L. 74.62 58.69 61.11 78.02 4.91 8.24 66.43 78.06 73.62 80.02 77.52 81.61
Density 77.04 72.85 77.58 81.15 - - 79.77 82.38 81.98 83.06 83.03 83.05
Cutout 70.79 67.32 71.57 74.60 - - 73.95 76.69 76.18 76.96 77.00 77.38
Crosstalk 76.92 67.51 69.41 80.98 - - 79.54 83.22 80.36 82.94 83.22 83.08
Gaussian (L) 61.09 71.12 56.73 62.70 - - 56.88 77.15 59.98 74.45 75.03 73.81
Uniform (L) 75.61 74.09 72.25 78.93 - - 75.92 81.62 80.68 81.74 81.79 82.44
Impulse (L) 78.33 74.65 76.88 81.79 - - 79.14 83.28 82.51 83.13 83.16 83.24
Gaussian (C) - - - - 1.18 1.96 78.20 79.01 82.22 82.86 83.05 81.32
Uniform (C) - - - - 2.19 3.90 79.14 81.39 82.37 83.22 83.03 82.06
Sensor Impulse (C) - - - - 1.52 1.71 78.51 78.87 82.16 82.75 83.00 81.59
Moving Obj. 48.02 45.47 46.23 50.75 1.40 4.63 50.97 45.34 13.66 43.56 42.62 42.89
Motion Motion Blur - - - - 2.95 3.32 72.49 77.75 82.50 83.12 83.06 82.92
Local Density 71.45 65.70 71.09 75.39 - - 74.36 77.30 76.83 81.71 81.24 81.15
Local Cutout 63.25 56.69 63.50 68.58 - - 66.53 72.40 60.62 71.95 72.07 73.78
Local Gaussian 68.16 73.11 65.65 68.03 - - 72.71 78.52 54.02 76.38 76.41 76.26
Local Uniform 76.67 74.68 74.37 80.17 - - 78.85 81.99 77.44 82.04 82.06 82.33
Object Local Impulse 78.47 75.18 77.38 82.33 - - 79.79 83.20 82.21 82.99 83.16 82.99
Average(APcor)AveragesubscriptAPcor\textbf{Average}(\text{AP}_{\text{cor}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 67.92 62.55 65.18 70.95 2.64 2.88 67.41 71.18 69.27 77.16 76.81 76.75
RCE(%)\downarrow 13.55 16.80 16.49 13.98 52.54 68.62 15.89 14.59 17.83 9.51 9.71 7.93
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

A.2.3 More Results on the KITTI-C validation set.

Besides the experimental results mentioned in the main text, we test our RoboFusion on KITTI-C and nuScenes-C Dong et al. (2023) to extend our work to a wider range of noise scenarios, including Gaussian, Uniform, Impulse, Moving Object, Motion Blur, Local Density, Local Cutout, Local Gaussian, Local Uniform, and Local Impulse, as shown in Tables 11, 12, and 13. From these Tables, compared with LiDAR-only methods including SECOND Yan et al. (2018), PointPillars Lang et al. (2019), PointRCNN Shi et al. (2019) and PV-RCNN Shi et al. (2020), Camera-Only methods including Smoke Liu et al. (2020b), ImVoxelNet Rukhovich et al. (2022), and multi-modal methods including EPNet Huang et al. (2020), Focals Conv Chen et al. (2022), and LoGoNet Li et al. (2023), our RoboFusion-L, RoboFusion-B, and RoboFusion-T consistently outperform across various noise scenarios and achieve the best overall performance. Overall, our RoboFusion demonstrates superior performance in weather-noisy (i.e. Snow, Rain, Fog, and Strong Sunlight) scenarios and exhibits better results across a broader range of scenarios, which shows remarkable robustness and generalizability.

A.2.4 Performance Comparison Analysis with the LoGoNet.

In addition, to provide a clearer analysis of performance across different noise scenarios, we present a more detailed comparative study of our RoboFusion-L and LoGoNet Li et al. (2023) on the KITTI-C validation dataset, as shown in Table 14. It is worth noting that LoGoNet is a SOTA multi-modal 3D detector known for its exceptional robustness and high accuracy. Dong et al. (2023) provides noise at varying levels, with the KITTI-C dataset including 5 severities. It is evident that our method demonstrates a high degree of robustness, exhibiting the most stable results with the variance of noise severities. For instance, when considering snow conditions, the performance of our RoboFusion-L shows a marginal variation from 86.69% to 83.67% across severities from 1 to 5. In contrast, LoGoNet’s performance drops from 55.07% to 45.02% over the same severity range. Furthermore, in the presence of moving object noise, our method outperforms LoGoNet. In summary, our RoboFusion exhibits remarkable robustness and generalization capabilities, making it well-suited to diverse noise scenarios.

Table 14: Performance comparison of our RoboFusion-L with LoGoNet on KITTI-C with 5 noise severities. The results are reported based on the car with AP of R40subscript𝑅40R_{40}italic_R start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT at moderate difficulty. ‘S.L.’ denotes Strong Sunlight. The better one is marked in bold.
Severity
Corruptions 1 2 3 4 5 APssubscriptAP𝑠\text{AP}_{s}AP start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
Snow 55.07 / 86.69 52.98 / 86.55 53.08 / 85.94 51.14 / 83.61 45.02 / 83.67 51.45 / 85.29
Rain 57.29 / 87.84 56.90 / 87.75 56.76 / 86.49 55.05 / 85.24 53.01 / 85.07 55.80 / 86.48
Fog 75.93 / 87.31 69.69 / 86.58 64.77 / 84.71 64.69 / 84.56 62.58 / 84.51 67.53 / 85.53
Weather S.L. 82.03 / 87.26 80.53 / 86.53 76.75 / 84.66 71.12 / 84.61 67.31 / 84.46 75.54 / 85.50
Density 86.60 / 86.81 84.59 / 86.59 84.05 / 85.60 82.74 / 85.27 82.42 / 84.30 83.68 / 85.71
Cutout 82.18 / 87.64 80.02 / 86.21 77.41 / 83.25 74.66 / 80.81 71.59 / 77.94 77.17 / 83.17
Crosstalk 84.22 / 84.41 83.38 / 84.38 81.41 / 84.13 80.78 / 83.79 80.22 / 83.90 82.00 / 84.12
Gaussian (L) 84.69 / 85.41 82.52 / 84.66 77.43 / 81.39 47.28 / 73.58 17.31 / 57.79 61.85 / 76.56
Uniform (L) 84.77 / 85.77 84.64 / 85.42 84.39 / 85.47 82.32 / 85.00 78.59 / 83.59 82.94 / 85.05
Impulse (L) 84.45 / 84.95 84.73 / 82.88 84.92 / 82.20 84.63 / 80.51 84.56 / 80.29 84.66 / 82.16
Gaussian (C) 84.53 / 85.77 84.47 / 85.42 84.31 / 85.47 84.18 / 85.32 83.96 / 84.32 84.29 / 85.26
Uniform (C) 84.74 / 85.57 84.57 / 85.08 84.54 / 82.96 84.36 / 82.53 84.05 / 80.36 84.45 / 83.30
Sensor Impulse (C) 84.53 / 85.70 84.26 / 83.63 84.38 / 83.54 83.95 / 82.42 83.86 / 82.28 84.20 / 83.51
Moving Obj. 58.89 / 78.46 12.78 / 67.86 0.43 / 41.07 0.06 / 36.28 0.07 / 22.85 14.44 / 49.30
Motion Motion Blur 84.64 / 85.23 84.53 / 84.98 84.56 / 84.72 84.45 / 83.00 84.43 / 82.96 84.52 / 84.17
Local Density 82.31 / 85.23 81.66 / 84.87 80.15 / 82.70 76.53 / 82.08 72.52 / 81.21 78.63 / 83.21
Local Cutout 76.77 / 82.94 72.46 / 81.31 65.87 / 78.14 59.14 / 74.12 50.17 / 69.61 64.88 / 77.22
Object Local Gaussian 84.45 / 86.81 81.12 / 86.25 67.13 / 82.72 33.33 / 76.01 12.27 / 63.31 55.66 / 79.02
Local Uniform 84.51 / 85.91 84.35 / 85.65 81.95 / 85.23 79.62 / 84.66 69.25 / 81.99 79.94 / 84.68
Local Impulse 84.53 / 85.65 84.47 / 85.13 84.32 / 85.18 84.40 / 85.16 83.72 / 85.16 84.29 / 85.25
APcsubscriptAP𝑐\text{AP}_{c}AP start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 79.35 / 85.56 75.73 / 84.38 72.93 / 81.77 68.22 / 79.92 63.34 / 76.97 71.81 / 81.72
Clean 85.04 / 88.04
Table 15: Comparison with SOTA methods on nuScenes-C validation set with mAP. ‘D.I.’ refers to DeepInteraction Yang et al. (2022). The best one is highlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.Dong et al. (2023).
Corruptions Lidar-Only Camera-Only LC Fusion
PointPillars CenterPoint FCOS3D DETR3D BEVFormer FUTR3D TransFusion BEVFusion D.I.* RoboFusion (Ours)
L B T
None(APcleansubscriptAPclean\text{AP}_{\text{clean}}AP start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT) 27.69 59.28 23.86 34.71 41.65 64.17 66.38 68.45 69.90 69.91 69.40 69.09
Snow 27.57 55.90 2.01 5.08 5.73 52.73 63.30 62.84 62.36 67.12 66.07 65.96
Rain 27.71 56.08 13.00 20.39 24.97 58.40 65.35 66.13 66.48 67.58 67.01 66.45
Fog 24.49 43.78 13.53 27.89 32.76 53.19 53.67 54.10 54.79 67.01 65.54 64.34
Weather S.L. 23.71 54.20 17.20 34.66 41.68 57.70 55.14 64.42 64.93 67.24 66.71 66.54
Density 27.27 58.60 - - - 63.72 65.77 67.79 68.15 69.48 69.02 68.58
Cutout 24.14 56.28 - - - 62.25 63.66 66.18 66.23 69.18 69.01 68.20
Crosstalk 25.92 56.64 - - - 62.66 64.67 67.32 68.12 68.68 68.04 68.17
FOV lost 8.87 20.84 - - - 26.32 24.63 27.17 42.66 39.48 39.30 39.43
Gaussian (L) 19.41 45.79 - - - 58.94 55.10 60.64 57.46 57.77 57.07 56.00
Uniform (L) 25.60 56.12 - - - 63.21 64.72 66.81 67.42 64.57 64.25 64.99
Impulse (L) 26.44 57.67 - - - 63.43 65.51 67.54 67.41 65.64 65.45 65.44
Gaussian (C) - - 3.96 14.86 15.04 54.96 64.52 64.44 66.52 66.73 66.75 66.53
Uniform (C) - - 8.12 21.49 23.00 57.61 65.26 65.81 65.90 65.77 65.76 65.56
Sensor Impulse (C) - - 3.55 14.32 13.99 55.16 64.37 64.30 65.65 64.82 64.75 64.56
Compensation 3.85 11.02 - - - 31.87 9.01 27.57 39.95 41.88 39.54 41.28
Motion Motion Blur - - 10.19 11.06 19.79 55.99 64.39 64.74 65.45 67.21 66.52 66.42
Local Density 26.70 57.55 - - - 63.60 65.65 67.42 67.71 66.74 66.59 65.88
Local Cutout 17.97 48.36 - - - 61.85 63.33 63.41 65.19 66.82 66.53 66.76
Local Gaussian 25.93 51.13 - - - 62.94 63.76 64.34 64.75 65.08 65.17 64.77
Local Uniform 27.69 57.87 - - - 64.09 66.20 67.58 66.44 66.71 66.19 65.40
Obeject Local Impulse 27.67 58.49 - - - 64.02 66.29 67.91 67.86 66.53 66.87 66.67
Average(APcor)AveragesubscriptAPcor\textbf{Average}(\text{AP}_{\text{\text{cor}}})Average ( AP start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT ) 22.99 49.78 8.94 18.71 22.12 56.88 58.77 61.35 62.92 63.90 63.43 63.23
RCE (%) \downarrow 16.95 16.01 62.51 46.07 46.89 11.34 11.45 10.36 9.97 8.58 8.59 8.47
  • 1

    : Results from Ref. Dong et al. (2023).

  • 2

    * denotes re-implement result.

A.2.5 More Results on the nuScenes-C validation set.

As depicted in Table 15, compared with LiDAR-only methods including PointPillars Lang et al. (2019), and CenterPoint Yin et al. (2021), Camera-Only methods FCOS3D Wang et al. (2021), DETR3D Wang et al. (2022b), and BEVFormer Li et al. (2022c) and multi-modal methods including FUTR3D Chen et al. (2023), TransFusion Bai et al. (2022), BEVFusion Liu et al. (2023) and DeepInteraction Yang et al. (2022), our RoboFusion demonstrates superior performance across more noise scenarios in AD on average. For instance, our RoboFusion-L excels in 10 noise scenarios, including Weather (Snow, Rain, Fog, Strong Sunlight), Sensor (Density, Cutout, Crosstalk), Motion (Compensation, Motion Blur), and Object (Local Cutout), outperforming DeepInteraction Yang et al. (2022) which achieves the best performance only in 5 of these noise scenarios. Overall, our method exhibits not only exceptional robustness in weather-induced noise scenarios, but also shows remarkable resilience across a broader noise include sensor, motion and object noise.

Refer to caption
Figure 5: Visualization Results of LoGoNet and our RoboFusion in KITTI-C dataset. We use boxes in red to represent false positives, green boxes for truth positives, and black for the ground truth. We use blue dashed ovals to highlight the pronounced improvements in predictions.

A.3 Visualization

As shown in Fig. 5, we provide visualization results between our RoboFusion-L and LoGoNet on the KITTI-C dataset. Overall, compared to SOTA methods like LoGoNet Li et al. (2023), our method enhances the robustness of multi-modal 3D object detection by leveraging the generalization capability and robustness of VFMs to mitigate OOD noisy scenarios in AD.

A.4 More Limitations

Although we have mentioned the two main limitations in the ‘Conclusions’ section of the main text, our RoboFusion still has other limitations. Our method does not achieve the best performance in all noisy scenarios. For instance, as shown in Table 11, our method does not show the best in ‘Moving Object’ noisy scenarios. Furthermore, we conduct experiments only on the corruption datasets Dong et al. (2023) rather than real-world datasets. It is valuable to construct a real-world corruption dataset, but it must be an expensive work.

References

  • Bai et al. [2022] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. TransFusion: Robust LiDAR-camera fusion for 3D object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1090–1099, 2022.
  • Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • Chen et al. [2022] Yukang Chen, Yanwei Li, Xiangyu Zhang, Jian Sun, and Jiaya Jia. Focal sparse convolutional networks for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5428–5437, 2022.
  • Chen et al. [2023] Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. FUTR3D: A unified sensor fusion framework for 3D detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 172–181, 2023.
  • Deng et al. [2021] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel R-CNN: Towards high performance voxel-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1201–1209, 2021.
  • Dong et al. [2023] Yinpeng Dong, Caixin Kang, Jinlai Zhang, Zijian Zhu, et al. Benchmarking robustness of 3D object detection to common corruptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1022–1032, 2023.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et al. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE conference on computer vision and pattern recognition, pages 3354–3361, 2012.
  • Hu et al. [2023] Qianjiang Hu, Daizong Liu, and Wei Hu. Density-insensitive unsupervised domain adaption on 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17556–17566, 2023.
  • Huang et al. [2020] Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. EPNet: Enhancing point features with image semantics for 3D object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2020.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Lang et al. [2019] Alex H Lang, Sourabh Vora, et al. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
  • Li et al. [2022a] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  • Li et al. [2022b] Yanwei Li, Xiaojuan Qi, Yukang Chen, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Voxel field fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1120–1129, 2022.
  • Li et al. [2022c] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, et al. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
  • Li et al. [2023] Xin Li, Tao Ma, Yuenan Hou, Botian Shi, et al. LoGoNet: Towards accurate 3D object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17524–17534, 2023.
  • Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • Liu et al. [2020a] Lin Liu, Jianzhuang Liu, Shanxin Yuan, Gregory Slabaugh, Aleš Leonardis, Wengang Zhou, and Qi Tian. Wavelet-based dual-branch network for image demoiréing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 86–102. Springer, 2020.
  • Liu et al. [2020b] Zechen Liu, Zizhang Wu, and Roland Tóth. SMOKE: Single-stage monocular 3D object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 996–997, 2020.
  • Liu et al. [2023] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023.
  • OpenAI [2023] OpenAI. Gpt-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf, 2023.
  • Oza et al. [2023] Poojan Oza, Vishwanath A Sindagi, Vibashan Vishnukumar Sharmini, and Vishal M Patel. Unsupervised domain adaptation of object detectors: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Peng et al. [2023] Xidong Peng, Xinge Zhu, and Yuexin Ma. CL3D: Unsupervised domain adaptation for cross-LiDAR 3D detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2047–2055, 2023.
  • Rukhovich et al. [2022] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. ImVoxelNet: Image to voxels projection for monocular and multi-view general-purpose 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
  • Shi et al. [2019] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–779, 2019.
  • Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. PV-RCNN: point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10529–10538, 2020.
  • Song et al. [2023] Ziying Song, Haiyue Wei, Lin Bai, et al. GraphAlign: Enhancing accurate feature alignment by graph matching for multi-modal 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3358–3369, 2023.
  • Song et al. [2024a] Ziying Song, Lin Liu, Feiyang Jia, et al. Robustness-aware 3D object detection in autonomous driving: A review and outlook. arXiv preprint arXiv:2401.06542, 2024.
  • Song et al. [2024b] Ziying Song, Guoxin Zhang, Jun Xie, Lin Liu, et al. Voxelnextfusion: A simple, unified and effective voxel fusion framework for multi-modal 3D object detection. arXiv preprint arXiv:2401.02702, 2024.
  • Tsai et al. [2023] Darren Tsai, Julie Stephany Berrio, Mao Shan, Eduardo Nebot, and Stewart Worrall. Viewer-centred surface completion for unsupervised domain adaptation in 3D object detection. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9346–9353, 2023.
  • Wang et al. [2021] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. FCOS3D: Fully convolutional one-stage monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
  • Wang et al. [2022a] Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
  • Wang et al. [2022b] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, et al. DERT3D: 3D object detection from multi-view images via 3D-to-2D queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  • Wang et al. [2023a] Li Wang, Xinyu Zhang, Ziying Song, et al. Multi-modal 3d object detection in autonomous driving: A survey and taxonomy. IEEE Transactions on Intelligent Vehicles, 2023.
  • Wang et al. [2023b] Yan Wang, Junbo Yin, Wei Li, et al. SSDA3D: Semi-supervised domain adaptation for 3D object detection from point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2707–2715, 2023.
  • Wu et al. [2023] QuanLin Wu, Hang Ye, Yuntian Gu, et al. Denoising masked autoencoders help robust classification. In The Eleventh International Conference on Learning Representations, 2023.
  • Xie et al. [2023] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, et al. SparseFusion: Fusing multi-modal sparse representations for multi-sensor 3D object detection. arXiv preprint arXiv:2304.14340, 2023.
  • Yan et al. [2018] Yan Yan, Yuxing Mao, and Bo Li. SECOND: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  • Yan et al. [2023] Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, et al. Cross modal transformer via coordinates encoding for 3D object dectection. arXiv preprint arXiv:2301.01283, 2023.
  • Yang et al. [2022] Zeyu Yang, Jiaqi Chen, Zhenwei Miao, et al. DeepInteraction: 3D object detection via modality interaction. Advances in Neural Information Processing Systems, 35:1992–2005, 2022.
  • Yin et al. [2021] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • Zhang et al. [2022] Yanan Zhang, Jiaxin Chen, and Di Huang. CAT-Det: Contrastively augmented transformer for multi-modal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 908–917, 2022.
  • Zhang et al. [2023a] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, et al. Faster segment anything: Towards lightweight SAM for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  • Zhang et al. [2023b] Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, and Xiang Bai. SAM3D: Zero-shot 3D object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023.
  • Zhang et al. [2023c] Xinyu Zhang, Li Wang, Jian Chen, et al. Dual radar: A multi-modal dataset with dual 4d radar for autononous driving. arXiv preprint arXiv:2310.07602, 2023.
  • Zhao et al. [2023] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, et al. Fast segment anything, 2023.
  • Zhu et al. [2020] Xinge Zhu, Yuexin Ma, Tai Wang, et al. SSN: Shape signature networks for multi-class object detection from point clouds. In European Conference on Computer Vision, pages 581–597. Springer, 2020.