22footnotetext: Corresponding authors.11institutetext: National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University
22institutetext: AIR, Tsinghua University 33institutetext: SenseTime Research

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Baicheng Li 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1
   Zike Yan 2AIR, Tsinghua University 2    Dong Wu 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1
   Hanqing Jiang 3SenseTime Research
3
   Hongbin Zha 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University11National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University12AIR, Tsinghua University 21National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University13SenseTime Research
31National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1
Abstract

Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different views with dynamic objects involved break the geometric and photometric consistency, whereas the consistency lays the foundation for joint optimizing the camera pose and the map parameters. In this paper, we best exploit the characteristics of continual learning and propose a novel SLAM framework for dynamic environments. While past efforts have been made to avoid catastrophic forgetting by exploiting an experience replay strategy, we view forgetting as a desirable characteristic. By adaptively controlling the replayed buffer, the ambiguity caused by moving objects can be easily alleviated through forgetting. We restrain the replay of the dynamic objects by introducing a continually-learned classifier for dynamic object identification. The iterative optimization of the neural map and the classifier notably improves the robustness of the SLAM system under a dynamic environment. Experiments on challenging datasets verify the effectiveness of the proposed framework.

1 Introduction

Refer to caption
Figure 1: We introduce a continual learning based SLAM framework under challenging dynamic environments (top row). The proposed method jointly learns a classifier to alleviate the effects induced by the moving objects (middle row), and a neural map to memorize past observations as a neural radiance field (bottom row). The iterative optimization of pose, map, and classifier parameters forms a robust SLAM system that learns to memorize and to forget adaptively in the changing open world.

Simultaneous localization and mapping (SLAM) describes the instant agent state and the environment where the agent operates. By constructing a consistent map on the fly given sequential observations, the agent gradually gains knowledge of the environment that can be utilized for downstream vision and robotics applications. The consistency in both temporal and spatial domains lays the foundation of the problem since the genesis of SLAM [8], where the view-invariant photometric and geometric cues within the map are leveraged to predict and validate future measurements [3]. Nevertheless, this consistency cannot be guaranteed in the setting as various environmental changes may occur due to the object movements. Alleviating the effects induced by the inconsistency between observations and the map is of great importance for robust long-term deployment.

Intuitively, a robust SLAM system can be achieved if frame-to-model alignment is conducted based on pure static/invariant features [30]. Such a system requires accurate identification of environmental changes in both observations and the map. Conventional methods turn to remove the dynamic objects in observations through motion segmentation, and update the discretized map by heuristically deleting the corresponding areas. Recent advances [45, 34] show that an implicit neural representation can also be updated in a purely static environment through test-time optimization to serve as the map of a dense SLAM system. Besides the experience-replay based continual learning paradigm that most methods adopt to avoid catastrophic forgetting of past observations, we argue that the forgetting is also a nice property to update the neural map given environmental changes. Only the retention of invariant features should be made from the observations, whereas the changing part will be naturally forgotten under constant distribution shifts.

In this work, we introduce a dense neural SLAM framework to tackle challenging dynamic scenarios. The key idea is the continual learning of two modules, a neural map f(𝐱;θMt)𝑓𝐱subscriptsuperscript𝜃𝑡𝑀f(\mathbf{x};\theta^{t}_{M})italic_f ( bold_x ; italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) that distills past observations into a continuous neural radiance field, and a binary classifier g(𝐳;θCt)𝑔𝐳subscriptsuperscript𝜃𝑡𝐶g(\mathbf{z};\theta^{t}_{C})italic_g ( bold_z ; italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) that records the motion status (static/dynamic) of each instance given the encoded feature 𝐳𝐳\mathbf{z}bold_z. The continual learning fashion guarantees online adaption to the instant state of scene geometry, appearance, and object motion status. Both modules accumulate knowledge from sequential observations and automatically decide what to memorize and what to forget. The inconsistent areas between the observation and the map will be identified and not contribute to the pose estimation and map updating. Such iterative optimization of poses, map parameters, and object motion status leads to a robust framework for dynamic SLAM under changing environments. To summarize, our main contributions include:

  • We present for the first time the deployment of a dense neural SLAM framework under challenging dynamic environments. The proposed method leads to reliable motion segmentation, robust camera tracking, and convenient map updating under diverse environmental changes.

  • We propose a continual learning method for updating a classifier that records the motion status of objects within the environment. The instance-aware classifier applies to open-world scenarios and shows positive forward and backward transfer. The module can also integrate prior knowledge regarding potentially movable instances through pre-training.

  • We show that the forgetting mechanism of continual learning can be exploited to update the neural scene representation under changing environmental conditions.

2 Related Work

Refer to caption
Figure 2: Overview of the proposed method. We effectively integrated instance segmentation module, visual encoder, and a continually-learned classifier to achieve accurate dynamic object identification, enabling robust localization and mapping in complex dynamic environments.

Visual SLAM in dynamic environments. This line of work removes dynamic objects and reconstructs the static environment. A number of methods [6, 32, 25] utilize warping or reprojection to identify the inconsistency in motion, appearance, or geometry. Although motion segmentation can be generalized to different environments, motion ambiguity leads to typical failures when a large portion of a moving object occupies the image. There is also an attempt [14] that relies on semantic segmentation  [11] to pre-define dynamic categories. However, the solution mainly relies on common sense and cannot adapt to the real open world. In addition, some approaches combine motion detection with semantic cues. DynaSLAM [2] and DRG-SLAM [44] filter out features that fall into a pre-defined category or avoid geometric constraints. They might over-segment dynamic areas and leave insufficient information for localization. On the contrary, SLAMANTIC [31] and CFP-SLAM [12] check the observations in the pre-defined category through projection and only remove the features that present inconsistency. Nonetheless, the temporal consistency of motion and its applicability to the open world are commonly ignored. The relevant methods struggle to trade-off between over-segmentation and under-segmentation. Additionally, there exists a category of work dedicated to change detection [1, 26, 27, 42, 37, 38], which can identify areas of change in the environment offline, given known camera poses and a complete point cloud. However, in the context of SLAM, both the camera poses and the complete environmental model are unknown, and the process needs to be conducted online. We effectively fulfill this requirement through a continually-learned classifier.

Dense SLAM with implicit neural representations. The compact and continuous representation power of implicit neural representations [20] draws public attention in the SLAM community. The seminal works of [34] and [45] show that neural representation can be updated through test-time optimization. The follow-ups try different neural representations for better efficiency or accuracy [49, 13, 16, 46, 7, 21, 23, 29]. Recently, Co-SLAM [43] leverages both the high-frequency preserving characteristics of coordinate-based representation and the fast convergence of optimizable feature grids for an accurate and efficient dense SLAM system. Though the above-mentioned methods make great progress in reconstructing static scenes, they are prone to failure when deployed in dynamic environments. We take a step further to address the problem by continually learning what to memorize and what to forget.

NeRF construction in changing environments. Although there is no prior work specified at tackling the NeRF-based SLAM deployment in dynamic environments, some research has been conducted to train a neural radiance field under changing environments. NeRF in the wild [19] pioneers the use of appearance embedding to handle environmental changes. The solution is frequently adopted in follow-up works [41, 36, 48, 35]. There are also works that target the reconstruction of the spatial-temporal 4D field of the environment [18, 40, 10, 9], where per-point radiance and motion are integrated into the neural representation. Recently, CLNeRF [4] leverages continual learning to progressively update the scene representation to adapt to changing environments. Similarly, DynaMoN [15] applies motion segmentation to DROID-SLAM [39] to achieve accurate camera pose estimation in dynamic environments. The estimated poses are then utilized for offline optimization of a 4D NeRF. In contrast, our work directly builds the neural radiance field on the fly in dynamic environments and targets a much more challenging SLAM problem. The invariant information within the environment is stored adaptively under different circumstances.

3 Preliminaries

The central idea of this work is a general framework to mitigate the discrepancy between observations and the stored map under changing environments. The framework is expected to distinguish among past observations what are the invariant features to memorize and what are the changing areas to forget. Through this manner, only reliable features that meet the multi-view consistency over a long period will be reserved, and only errors outside the dynamic areas will be back-propagated to update the pose and map parameters. In practice, we resort to a continual learning fashion, where a neural map and a motion status classifier are trained on the fly that distill knowledge from sequential observations into compact networks. The map will serve as a global memory of the scene radiance, whereas the classifier will serve as a dynamic object detector. The iterative optimization of both networks defines the memorization-forgetting loop for a robust neural SLAM system.

Learn to memorize Following the recent progress of neural SLAM, we adopt the experience-replay based continual learning for test-time map optimization, where a set of keyframes are stored explicitly for back-propagating errors to the network parameter optimization. The keyframes can be viewed as a compressed knowledge of past experience. The map can then memorize the knowledge through gradient-based optimization given the constantly replayed keyframe buffer.

Learn to forget As neural networks exhibit high plasticity, past knowledge can be easily forgotten during constant distribution shift [45]. We expect the framework to adaptively forget the scene dynamics while preserving the invariant information. Note that the past knowledge is controlled by the replayed keyframes, we utilize a classifier to identify the areas within stored keyframes that have been changed. The forgetting in the neural mapping naturally undergoes if the dynamic areas on each keyframe are prohibited from replaying. The classifier should be instance-wise and allow efficient adaptation once the environmental changes occur. The updated motion status can then be passed to the stored keyframes to enforce forgetting.

4 Method

Fig. 2 shows the overview of our dynamic SLAM framework. In practice, streaming RGB-D images {It,Dt}t=1Nsuperscriptsubscriptsuperscript𝐼𝑡superscript𝐷𝑡𝑡1𝑁\{I^{t},D^{t}\}_{t=1}^{N}{ italic_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with known camera intrinsic K𝐾Kitalic_K are taken as inputs, and a neural radiance field f(𝐱;θMt)𝑓𝐱subscriptsuperscript𝜃𝑡𝑀f(\mathbf{x};\theta^{t}_{M})italic_f ( bold_x ; italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) is updated continually to memorize the static part of the environment. The key to our robust SLAM framework in dynamic environments is a continually learned binary classifier g(𝐳;θCt)𝑔𝐳subscriptsuperscript𝜃𝑡𝐶g(\mathbf{z};\theta^{t}_{C})italic_g ( bold_z ; italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ). The past knowledge to be forgotten will be determined through the classifier by identifying inconsistency induced by object movement. Note that the optimization of camera pose ξtsuperscript𝜉𝑡\xi^{t}italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, neural map θMtsuperscriptsubscript𝜃𝑀𝑡\theta_{M}^{t}italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and motion status classifier θCtsuperscriptsubscript𝜃𝐶𝑡\theta_{C}^{t}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT all rely on the discrepancy between rendered and observed RGB-D images. As illustrated in Fig. 3, the tight coupling of these three variables makes the optimization inherently ambiguous: the divergence of any variables leads to a high discrepancy. In this section, we begin by introducing the photometric and geometric constraints through volume rendering, followed by how these constraints propagate gradients for optimizing the camera pose, map, and classifier parameters iteratively. We argue that this ambiguity can be alleviated with the neural representations of the map and the classifier, where the continuous representations exhibit promising generalization ability and enforce temporally consistent predictions.

Refer to caption
Figure 3: The optimization divergence of either camera pose, neural map, or motion status classifier will lead to high photometric and geometric errors.

4.1 Formulation

The objective is to guarantee a photorealistic and multi-view consistent map representation by minimizing the photometric and geometric errors in static areas between the rendered images and observed images as:

Lpho=1|H|(u,v)H|I[u,v]I^(ξ,θM)[u,v]|,subscript𝐿𝑝𝑜1𝐻subscript𝑢𝑣𝐻𝐼𝑢𝑣^𝐼𝜉subscript𝜃𝑀𝑢𝑣L_{pho}=\frac{1}{|H|}\sum_{(u,v)\in H}\left|I\left[u,v\right]-\hat{I}(\xi,% \theta_{M})\left[u,v\right]\right|,italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_H | end_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ italic_H end_POSTSUBSCRIPT | italic_I [ italic_u , italic_v ] - over^ start_ARG italic_I end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] | , (1)
Lgeo=1|H|(u,v)H|D[u,v]D^(ξ,θM)[u,v]|,subscript𝐿𝑔𝑒𝑜1𝐻subscript𝑢𝑣𝐻𝐷𝑢𝑣^𝐷𝜉subscript𝜃𝑀𝑢𝑣L_{geo}=\frac{1}{|H|}\sum_{(u,v)\in H}\left|D\left[u,v\right]-\hat{D}(\xi,% \theta_{M})\left[u,v\right]\right|,italic_L start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_H | end_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ italic_H end_POSTSUBSCRIPT | italic_D [ italic_u , italic_v ] - over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] | , (2)

where H𝟙(Rt)𝐻1superscript𝑅𝑡H\subset\mathbbm{1}(R^{t})italic_H ⊂ blackboard_1 ( italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is the set of static samples given motion segmentation mask Rtsuperscript𝑅𝑡R^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT indicated by the classifier. I^(ξt,θMt)^𝐼superscript𝜉𝑡subscriptsuperscript𝜃𝑡𝑀\hat{I}(\xi^{t},\theta^{t}_{M})over^ start_ARG italic_I end_ARG ( italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) and D^(ξt,θMt)^𝐷superscript𝜉𝑡subscriptsuperscript𝜃𝑡𝑀\hat{D}(\xi^{t},\theta^{t}_{M})over^ start_ARG italic_D end_ARG ( italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) are predicted color and depth images through volume rendering.

In this paper, we follow Co-SLAM [43] to model the scene as a truncated signed distance field s𝑠sitalic_s with color values 𝐜𝐜\mathbf{c}bold_c as f(𝐱;θ)(𝐜,s)𝑓𝐱𝜃𝐜𝑠f(\mathbf{x};\theta)\rightarrow(\mathbf{c},s)italic_f ( bold_x ; italic_θ ) → ( bold_c , italic_s ). Nevertheless, the proposed method can also adopt the conventional NeRF representation with density-based volume rendering as [20]. In this work, the learnable map parameters θ={α,ϕ,τ}𝜃𝛼italic-ϕ𝜏\theta=\{\alpha,\phi,\tau\}italic_θ = { italic_α , italic_ϕ , italic_τ } denote the encoder 𝒱(𝐱;α)𝒱𝐱𝛼\mathcal{V}(\mathbf{x};\alpha)caligraphic_V ( bold_x ; italic_α ) for grid features and the decoders for color and geometry. The volume rendering for color and depth prediction is a weighted sum of all samples {𝐩i}i=1Nsuperscriptsubscriptsubscript𝐩𝑖𝑖1𝑁\{\mathbf{p}_{i}\}_{i=1}^{N}{ bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT along a ray determined by the pixel coordinate and the camera origin as:

I^(ξ,θM)[u,v]=1i=1Nwii=1Nwi𝐜i,^𝐼𝜉subscript𝜃𝑀𝑢𝑣1superscriptsubscript𝑖1𝑁subscript𝑤𝑖superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐜𝑖\hat{I}(\xi,\theta_{M})[u,v]=\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}% \mathbf{c}_{i},over^ start_ARG italic_I end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (3)
D^(ξ,θM)[u,v]=1i=1Nwii=1Nwidi,^𝐷𝜉subscript𝜃𝑀𝑢𝑣1superscriptsubscript𝑖1𝑁subscript𝑤𝑖superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑑𝑖\hat{D}(\xi,\theta_{M})[u,v]=\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}d% _{i},over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the computed weight; disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the distance between the sampled point 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the corresponding camera origin. It should be noted that in Co-SLAM [43], the weight w𝑤witalic_w is computed by multiplying two Sigmoid functions as:

w=σ(sλtr)σ(sλtr).𝑤𝜎𝑠subscript𝜆𝑡𝑟𝜎𝑠subscript𝜆𝑡𝑟w=\sigma\left(\frac{s}{\lambda_{tr}}\right)\sigma\left(-\frac{s}{\lambda_{tr}}% \right).italic_w = italic_σ ( divide start_ARG italic_s end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_ARG ) italic_σ ( - divide start_ARG italic_s end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_ARG ) . (5)

where λtrsubscript𝜆𝑡𝑟\lambda_{tr}italic_λ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT is the truncation distance.

4.2 Motion-aware Tracking and Mapping

Besides the photometric and geometric constraints in Eq. 1 and 2, we also adopt the SDF losses Lsdfsubscript𝐿𝑠𝑑𝑓L_{sdf}italic_L start_POSTSUBSCRIPT italic_s italic_d italic_f end_POSTSUBSCRIPT in near-surface and free space Lfreesubscript𝐿𝑓𝑟𝑒𝑒L_{free}italic_L start_POSTSUBSCRIPT italic_f italic_r italic_e italic_e end_POSTSUBSCRIPT along with the feature smoothness loss Lsmoothsubscript𝐿𝑠𝑚𝑜𝑜𝑡L_{smooth}italic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT in Co-SLAM [43] for further constraints.

The SDF loss is employed to enhance the quality of map reconstruction, where the depth values of pixels are utilized as SDF approximations. For points within the truncated region (|D[u,v]D^[u,v]|tr)𝐷𝑢𝑣^𝐷𝑢𝑣𝑡𝑟(|D[u,v]-\hat{D}[u,v]|\leq tr)( | italic_D [ italic_u , italic_v ] - over^ start_ARG italic_D end_ARG [ italic_u , italic_v ] | ≤ italic_t italic_r ):

Lsdf=1|RH|rRH1|Srtr|pSrtr(sp(D[u,v]D^(ξ,θM)[u,v]))2subscript𝐿𝑠𝑑𝑓1subscript𝑅𝐻subscript𝑟subscript𝑅𝐻1superscriptsubscript𝑆𝑟𝑡𝑟subscript𝑝superscriptsubscript𝑆𝑟𝑡𝑟superscriptsubscript𝑠𝑝𝐷𝑢𝑣^𝐷𝜉subscript𝜃𝑀𝑢𝑣2L_{sdf}=\frac{1}{|R_{H}|}\sum_{r\in R_{H}}\frac{1}{|S_{r}^{tr}|}\sum_{p\in S_{% r}^{tr}}\big{(}s_{p}-(D[u,v]-\hat{D}(\xi,\theta_{M})[u,v])\big{)}^{2}italic_L start_POSTSUBSCRIPT italic_s italic_d italic_f end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - ( italic_D [ italic_u , italic_v ] - over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

where RHsubscript𝑅𝐻R_{H}italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT represents the set of rays corresponding to the static pixel set H𝐻Hitalic_H.

For points outside of the truncation region (|D[u,v]D^[u,v]|>tr)𝐷𝑢𝑣^𝐷𝑢𝑣𝑡𝑟(|D[u,v]-\hat{D}[u,v]|>tr)( | italic_D [ italic_u , italic_v ] - over^ start_ARG italic_D end_ARG [ italic_u , italic_v ] | > italic_t italic_r ), we calculate the free space loss:

Lfree=1|RH|rRH1|SrSrtr|pSrSrtr(sptr)2,subscript𝐿𝑓𝑟𝑒𝑒1subscript𝑅𝐻subscript𝑟subscript𝑅𝐻1subscript𝑆𝑟superscriptsubscript𝑆𝑟𝑡𝑟subscript𝑝subscript𝑆𝑟superscriptsubscript𝑆𝑟𝑡𝑟superscriptsubscript𝑠𝑝𝑡𝑟2L_{free}=\frac{1}{|R_{H}|}\sum_{r\in R_{H}}\frac{1}{|S_{r}\setminus S_{r}^{tr}% |}\sum_{p\in S_{r}\setminus S_{r}^{tr}}(s_{p}-tr)^{2},italic_L start_POSTSUBSCRIPT italic_f italic_r italic_e italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∖ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∖ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_t italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

To mitigate the noisy reconstruction of unvisited areas induced by hash collision, a smoothness loss for the interpolated features is applied:

Lsmooth=1|𝒢|p𝒢Δx2+Δy2+Δz2,subscript𝐿𝑠𝑚𝑜𝑜𝑡1𝒢subscript𝑝𝒢superscriptsubscriptΔ𝑥2superscriptsubscriptΔ𝑦2superscriptsubscriptΔ𝑧2L_{smooth}=\frac{1}{|\mathcal{G}|}\sum_{p\in\mathcal{G}}\Delta_{x}^{2}+\Delta_% {y}^{2}+\Delta_{z}^{2},italic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_G | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ caligraphic_G end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

where Δx,y,z=Vα(p+ϵx,y,z)Vα(p)Δ𝑥𝑦𝑧subscript𝑉𝛼𝑝subscriptitalic-ϵ𝑥𝑦𝑧subscript𝑉𝛼𝑝\Delta x,y,z=V_{\alpha}(p+\epsilon_{x,y,z})-V_{\alpha}(p)roman_Δ italic_x , italic_y , italic_z = italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_p + italic_ϵ start_POSTSUBSCRIPT italic_x , italic_y , italic_z end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_p ) refers to the change in feature metrics between adjacent vertices sampled on the hash-grid.

We apply ADAM optimizer on the weighted sum of these five loss terms:

L=Lpho+λ1Lgeo+λ2Lsdf+λ3Lfree+λ4Lsmooth.𝐿subscript𝐿𝑝𝑜subscript𝜆1subscript𝐿𝑔𝑒𝑜subscript𝜆2subscript𝐿𝑠𝑑𝑓subscript𝜆3subscript𝐿𝑓𝑟𝑒𝑒subscript𝜆4subscript𝐿𝑠𝑚𝑜𝑜𝑡L=L_{pho}+\lambda_{1}L_{geo}+\lambda_{2}L_{sdf}+\lambda_{3}L_{free}+\lambda_{4% }L_{smooth}.italic_L = italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_s italic_d italic_f end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f italic_r italic_e italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT . (9)

Note that in the loss function, only errors in the static areas will be back-propagated for optimizing the pose and map parameters, where the motion segmentation mask Rtsuperscript𝑅𝑡R^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is inferred using the instance-wise classifier g(𝐳;θCt)𝑔𝐳subscriptsuperscript𝜃𝑡𝐶g(\mathbf{z};\theta^{t}_{C})italic_g ( bold_z ; italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) in Sec. 4.3. During the tracking process, we randomly select |Ht|subscript𝐻𝑡|H_{t}|| italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | pixels within the static area of the current frame and optimize the estimated camera pose by minimizing the objective function with fixed map and classifier parameters.

Similarly, the bundle adjustment for jointly optimizing camera poses and map parameters is carried out by randomly selecting |Hb|subscript𝐻𝑏|H_{b}|| italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | pixels within the static area of the past keyframes and the current frame. Stored keyframes serve as past experience to avoid catastrophic forgetting [45, 34]. As the error of pixels in the dynamic areas is inhibited from back-propagation, only knowledge in the static areas will be reserved in the neural map. As illustrated in Fig. 4, the moving object will be gradually forgotten in the neural map given constantly propagated errors from recent observations.

Refer to caption
Figure 4: The rendering results given the continually learned map indicate the gradual forgetting of the dynamic object.

4.3 Updating of Object Motion Status

As illustrated in Fig. 2, the motion status is inferred at the instance level. An image Itsuperscript𝐼𝑡I^{t}italic_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is decomposed into Ktsuperscript𝐾𝑡K^{t}italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT segments {Skt}k=1Ktsuperscriptsubscriptsuperscriptsubscript𝑆𝑘𝑡𝑘1superscript𝐾𝑡\{S_{k}^{t}\}_{k=1}^{K^{t}}{ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT using FastSAM [47], where each segment is then fed to a visual context encoder [28, 17, 22]. Such decomposition and encoding process turns the image into Ktsuperscript𝐾𝑡K^{t}italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT vectors {𝐳kt}k=1Ktsuperscriptsubscriptsuperscriptsubscript𝐳𝑘𝑡𝑘1superscript𝐾𝑡\{\mathbf{z}_{k}^{t}\}_{k=1}^{K^{t}}{ bold_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to represent Ktsuperscript𝐾𝑡K^{t}italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT different instances. Intuitively, we aim to maintain the motion status of each instance across time. Explicitly aggregating information of each instance across views is non-trivial as the dynamic objects lead to inconsistency. We turn to a simple but effective solution that implicitly records the motion status of each instance using a two-layer MLP as g(𝐳;θCt)𝑔𝐳superscriptsubscript𝜃𝐶𝑡g(\mathbf{z};\theta_{C}^{t})italic_g ( bold_z ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). The network serves as a classifier that determines the motion status of the corresponding instance feature, where an instance will be treated as a moving object if g(𝐳kt;θCt)>0.5𝑔subscriptsuperscript𝐳𝑡𝑘superscriptsubscript𝜃𝐶𝑡0.5g(\mathbf{z}^{t}_{k};\theta_{C}^{t})>0.5italic_g ( bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) > 0.5.

Refer to caption
Figure 5: Concatenating the position feature with the semantic embedding allows the classifier to only forget the knowledge regarding the box in the previous position. After being replaced, the box radiance at the latest position will be memorized after a few time intervals.

The object movement leads to inconsistency between the observation and the stored knowledge. As emitted radiance is view-dependent [20], we merely examine the geometry inconsistency between the rendered depth map and the observed one. The moving status oktsuperscriptsubscript𝑜𝑘𝑡o_{k}^{t}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of the instance k𝑘kitalic_k will be treated as True if a certain portion of pixels within the segment Sktsubscriptsuperscript𝑆𝑡𝑘S^{t}_{k}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT exhibit large discrepancies as:

(u,v)Skt𝟙(D^(ξ,θM)[u,v]D[u,v]>td)|Skt|>tp,subscript𝑢𝑣superscriptsubscript𝑆𝑘𝑡1^𝐷𝜉subscript𝜃𝑀𝑢𝑣𝐷𝑢𝑣subscript𝑡𝑑superscriptsubscript𝑆𝑘𝑡subscript𝑡𝑝\frac{\sum_{(u,v)\in S_{k}^{t}}\mathbbm{1}\left(\hat{D}(\xi,\theta_{M})[u,v]-D% [u,v]>t_{d}\right)}{|S_{k}^{t}|}>t_{p},divide start_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_1 ( over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] - italic_D [ italic_u , italic_v ] > italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG > italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , (10)

where tdsubscript𝑡𝑑t_{d}italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT controls the sensitivity of motion segmentation.

The classifier will be updated if the current moving status of any instance oktsuperscriptsubscript𝑜𝑘𝑡o_{k}^{t}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT contradicts the classification result 𝟙(g(𝐳kt;θCt1)>0.5)1𝑔subscriptsuperscript𝐳𝑡𝑘superscriptsubscript𝜃𝐶𝑡10.5\mathbbm{1}(g(\mathbf{z}^{t}_{k};\theta_{C}^{t-1})>0.5)blackboard_1 ( italic_g ( bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) > 0.5 ). The network is optimized using the binary cross-entropy loss as:

k=1Kt[oktlog(g(𝐳kt;θC))+(1okt)log(1g(𝐳kt;θC))].superscriptsubscript𝑘1superscript𝐾𝑡delimited-[]superscriptsubscript𝑜𝑘𝑡𝑔subscriptsuperscript𝐳𝑡𝑘superscriptsubscript𝜃𝐶1superscriptsubscript𝑜𝑘𝑡1𝑔subscriptsuperscript𝐳𝑡𝑘superscriptsubscript𝜃𝐶-\sum_{k=1}^{K^{t}}\big{[}o_{k}^{t}\log(g(\mathbf{z}^{t}_{k};\theta_{C}^{*}))+% (1-o_{k}^{t})\log(1-g(\mathbf{z}^{t}_{k};\theta_{C}^{*}))\big{]}.- ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT [ italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( italic_g ( bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( 1 - italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) roman_log ( 1 - italic_g ( bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] . (11)

Once the classifier is updated, the motion segmentation mask of all keyframes will be updated. The observations regarding the moving object will no longer contribute to the pose and map optimization. As the neural map is continually trained under constant distribution shift, the knowledge regarding the moving object will be gradually forgotten without manual operations.

Continual learning of the classifier. Catastrophic forgetting is not only an issue that neural SLAM system need to address, it also has a negative impact on the classifier. If we train the classifier using only the objects that appear in the current frame, catastrophic forgetting could occur, causing the classifier to forget previously learned dynamic objects. Therefore, we also maintain a replay buffer for the classifier to avoid this issue. For all instances in the current frame, if an instance is marked as dynamic after passing through a check for geometric inconsistency, it will be added to this replay buffer. When updating the classifier, training is conducted not only with instances from the current frame but also by randomly selecting ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT instances from the replay buffer as training data. ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is typically set to 5 in our experiments.

Refer to caption
Figure 6: The continual learning of the classifier on the fr3_walking_static sequence of TUM RGB-D dataset. A segment is treated as a dynamic part if a certain portion within the segment exhibits high discrepancy.

Bidirectional check. Eq. 10 identifies a moving object if it is observed in front of the rendered place. However, if an object is removed from the scene or if the object moves away from the camera since the first frame, the observation may fall behind the rendered one. Therefore, we apply a bidirectional check. Similar to Eq. 10, we decompose the rendered image D^(ξ,θM)^𝐷𝜉subscript𝜃𝑀\hat{D}(\xi,\theta_{M})over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) into K^tsuperscript^𝐾𝑡\hat{K}^{t}over^ start_ARG italic_K end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT segments {S^kt}k=1K^tsuperscriptsubscriptsuperscriptsubscript^𝑆𝑘𝑡𝑘1superscript^𝐾𝑡\{\hat{S}_{k}^{t}\}_{k=1}^{\hat{K}^{t}}{ over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_K end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and apply the consistency check as:

(u,v)S^kt𝟙(D[u,v]D^(ξ,θM)[u,v]>td)|S^kt|>tp.subscript𝑢𝑣superscriptsubscript^𝑆𝑘𝑡1𝐷𝑢𝑣^𝐷𝜉subscript𝜃𝑀𝑢𝑣subscript𝑡𝑑superscriptsubscript^𝑆𝑘𝑡subscript𝑡𝑝\frac{\sum_{(u,v)\in\hat{S}_{k}^{t}}\mathbbm{1}\left(D[u,v]-\hat{D}(\xi,\theta% _{M})[u,v]>t_{d}\right)}{|\hat{S}_{k}^{t}|}>t_{p}.divide start_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_D [ italic_u , italic_v ] - over^ start_ARG italic_D end_ARG ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) [ italic_u , italic_v ] > italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG | over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG > italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT . (12)

Incorporation of prior knowledge. One promising characteristic of the open-set classifier is that we can apply prior knowledge through pre-training to pre-define the instance motion status while allowing adaptation to the current state. Specifically, we can train a classifier θCpresuperscriptsubscript𝜃𝐶𝑝𝑟𝑒\theta_{C}^{pre}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT on specific categories, e.g., human, car, animal. The object motion status can then be identified as "dynamic" if either the pre-trained one or the online-learned one agrees.

Maintaining the instant object state. In the above-mentioned setting, we would label any object that has been moved as "dynamic" even if it is replaced in a new area and remains static afterward. It is due to the fact that the classifier merely establishes the mapping between the instance feature and the corresponding motion state. Thanks to the flexible structure of the classifier, we can concatenate an additional position feature along with the semantic embedding as the inputs. The classifier turns to record the motion state of an object at a specific position (object center 𝐩ktsuperscriptsubscript𝐩𝑘𝑡\mathbf{p}_{k}^{t}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for instance) as g(𝐳kt,𝐩kt;θCt)𝑔superscriptsubscript𝐳𝑘𝑡superscriptsubscript𝐩𝑘𝑡superscriptsubscript𝜃𝐶𝑡g(\mathbf{z}_{k}^{t},\mathbf{p}_{k}^{t};\theta_{C}^{t})italic_g ( bold_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). As we show in the experiment, this simple strategy leads to a neural map that forgets the removed object and memorizes the object at the new position gradually.

5 Experiments

We evaluate the proposed method on multiple sequences of TUM RGB-D dataset [33] and Bonn RGB-D dataset [24]. The quality of estimated poses, neural map, and motion status reasoning are analyzed qualitatively and quantitatively. As the first NeRF-SLAM-based method that tackles challenging dynamic environments, we compare the proposed method against traditional methods of ReFusion [25], StaticFusion [32], and DynaSLAM [2] that are designed specifically to dynamic environments. We also compare against our codebase of Co-SLAM [43] to see how the proposed strategies guarantee robust camera tracking with promising reconstruction quality.

5.1 Experimental Setup

The experiments are conducted on a desktop PC with an Intel i9-12900K CPU, an NVIDIA RTX 3090 GPU, and 64GB memory. The keyframe is automatically stored every 5 frames.

Specific to implementation details, the number of sampling points along each ray N=128𝑁128N=128italic_N = 128, sampling pixels during tracking and bundle adjustment Ht=1024,Hb=2048formulae-sequencesubscript𝐻𝑡1024subscript𝐻𝑏2048H_{t}=1024,H_{b}=2048italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1024 , italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 2048, truncation distance tr=10cm𝑡𝑟10𝑐𝑚tr=10cmitalic_t italic_r = 10 italic_c italic_m, threshold used for determining moving status td=0.3,tp=0.05formulae-sequencesubscript𝑡𝑑0.3subscript𝑡𝑝0.05t_{d}=0.3,t_{p}=0.05italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.3 , italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.05. For the weights of each loss in Eq. 9, we follow the settings of Co-SLAM as: λ1=0.1,λ2=5000,λ3=10,λ4=1e8formulae-sequencesubscript𝜆10.1formulae-sequencesubscript𝜆25000formulae-sequencesubscript𝜆310subscript𝜆41𝑒8\lambda_{1}=0.1,\lambda_{2}=5000,\lambda_{3}=10,\lambda_{4}=1e-8italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1 , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5000 , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 10 , italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 1 italic_e - 8.

Table 1: Comparisons of ATE (RMS) against traditional SLAM algorithms designed for deployment in dynamic environments.
{tblr}

column3 = c, column4 = c, column5 = c, column6 = c, column7 = c, cell21 = r=9, cell111 = r=6, hline1-2 = -, hline11,17 = -, & Sequence ReFusion [25] StaticFusion [32] DynaSLAM [2] Co-SLAM [43] Ours
Bonn balloon 0.175 0.233 0.050 0.308 0.206
balloon2 0.254 0.293 0.142 0.290 0.136
kidnapping_box 0.148 0.336 0.026 0.095 0.112
kidnapping_box2 0.161 0.263 0.033 0.118 0.104
crowd 0.204 3.586 1.065 fail 0.116
crowd2 0.155 0.215 1.217 fail 0.200
crowd3 0.137 0.168 0.835 fail 0.107
person_tracking 0.289 0.484 0.714 fail 0.274
synchronous 0.441 0.446 0.977 0.634 0.130
TUM fr3_walking_static 0.017 0.015 0.014 fail 0.025
fr3_walking_xyz 0.099 0.093 0.085 fail 0.076
fr3_walking_halfsphere 0.104 0.681 0.084 fail 0.079
fr3_sitting_static 0.009 0.014 0.009 0.011 0.007
fr3_sitting_xyz 0.040 0.039 0.009 0.020 0.018
fr3_sitting_halfsphere 0.110 0.041 0.017 0.042 0.039

Refer to caption
Figure 7: The comparison regarding whether the classifier undergoes continual learning. The 2nd and 4th rows illustrate the classifier’s determination results for dynamic instances in frame 190 at different times. The 3rd and 5th rows display the rendering results to demonstrate changes in the map.

5.2 Tracking and Mapping

As illustrated in Fig. 3, the photometric and geometric errors in dynamic areas lead to inherent ambiguities as they will be back-propagated to modify the pose estimates and mapping results. As tracking and mapping are coupled during the optimization, the erroneous tracking and the fusion of dynamic objects into the map would eventually lead to system collapse. The proposed method, on the other hand, achieves robust camera tracking results even in the challenging ’crowd’ sequences. As presented in Tab. 1, the proposed method achieves better results compared against the feature-based DynaSLAM [2] and the dense SLAM systems of ReFusion [25] and StaticFusion [32]. The effects arising from the dynamic objects are well alleviated by the proposed motion status classifier.

The forgetting and memorization. The forgetting mechanism is demonstrated in Fig. 4. Once the classifier identifies the man as a moving object, the knowledge will no longer act on the map parameters. Meanwhile, the areas occluded by the man will appear once the man goes away, and the inconsistency between the rendered image and the observation in other views will contribute to the map updating. As demonstrated in Sec. 4.3, one interesting fact is that we can also maintain the updated object status through a simple coordinate concatenation. As illustrated in Fig. 5, the box will be forgotten after being kidnapped, and reappear on the map after being placed.

Refer to caption
Figure 8: The different choices of visual encoder lead to diverse behaviors for predicting object motion status in incoming frames.

5.3 Motion Segmentation

As illustrated in Fig. 6, the continual learning of the classifier leads to dynamic updating of the object motion status. The successive changes in motion status can be well reflected by the geometric inconsistency, and the classifier can quickly identify the motion changes and learn the instant motion status. The encoded feature of masked objects can well distinguish between two people even if they fall into the same "person" category.

Continual learning of the classifier. As mentioned in Sec. 4.3, the classifier could also encounter catastrophic forgetting. Therefore, we implement continual learning for the classifier by maintaining a replay buffer. As demonstrated in Fig. 7, we select a specific frame (frame 190) for observation. It is evident that the classifier with replay maintains its ability for accurate motion segmentation on the earlier frame over time. In contrast, the classifier without continual learning is likely to perform poorly on the previous training data after updates, segmenting the dynamic instances incorrectly. Specifically, the person in red gradually walks out of the frame (see the second column of Fig. 7). Meanwhile, the classifier undergoes an update. Without replay, training solely on instances from the current frame, the classifier forgets that the person in red is also dynamic. This oversight leads to the person erroneously appearing in the map, further affecting subsequent calculations of geometric inconsistency. This experiment strongly proves that the continual learning strategy designed for the classifier is both effective and necessary.

The choices of visual encoders. To further understand how the motion segmentation behavior is affected by encoded features, we look into the motion prediction given a fixed classifier. As illustrated in Fig. 8, after training the classifier given supervisory signal from the frame I40superscript𝐼40I^{40}italic_I start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT, we freeze the network θC40superscriptsubscript𝜃𝐶40\theta_{C}^{40}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT and predict the motion status of the human with the following observations I40,,I280superscript𝐼40superscript𝐼280I^{40},\cdots,I^{280}italic_I start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT , ⋯ , italic_I start_POSTSUPERSCRIPT 280 end_POSTSUPERSCRIPT every 30 frames as g(𝐳k40:280;θC40)𝑔superscriptsubscript𝐳𝑘:40280superscriptsubscript𝜃𝐶40g(\mathbf{z}_{k}^{40:280};\theta_{C}^{40})italic_g ( bold_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 : 280 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT ). We compare three different visual encoders (BLIP [17], CLIP [28], and DinoV2 [22]) and visualize the predicted motion state and the encoded feature similarity compared to the frame I40superscript𝐼40I^{40}italic_I start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT. Apparently, the powerful semantic information of language-aligned visual context presents promising temporal consistency even if the object poses change drastically. The consistent feature embedding across views makes the implicit classifier instance-aware and easy to adapt. We argue that the promising prediction capability leads to robust camera tracking under complex scene dynamics.

Refer to caption
Figure 9: The incorporation of prior knowledge leads to more robust camera tracking under high dynamic environments.

Incorporating pre-defined prior knowledge. As demonstrated in Sec. 4.3, the proposed framework allows convenient incorporation of prior knowledge by pre-training a classifier on specific categories. An exemplary case is illustrated in Fig. 9. Here we experiment using a segment after frame 410 from the freiburg3 walking static sequence, where individuals occupy a significant portion of the frame from the outset. By pre-defining humans as dynamic objects and pre-training a classifier using a human matting dataset [5], the ambiguity induced by human motion can be alleviated at the very beginning. On the contrary, the classifier without prior knowledge will have trouble figuring out the actual factor that leads to inconsistency in such a challenging case. As illustrated in the figure, the accurate static map can be reconstructed even though the moving objects occupy nearly half of the frame when the stream begins.

6 Conclusion

In this paper, we address the dynamic SLAM problem with a neural map representation. By learning an instance-aware classifier online that implicitly records the per-object motion status, the invariant information within the observations can be continually distilled to the neural map, where the interference induced by dynamic objects is best alleviated. The iterative optimization of camera pose, map, and classifier parameters forms a robust SLAM framework in challenging dynamic environments.

Acknowledgement

We gratefully acknowledge the anonymous reviewers and AC for their valuable comments and suggestions. This work is supported by NSFC (U22A2061, 62176010) and 230601GP0004.

References

  • [1] Adam, A., Sattler, T., Karantzalos, K., Pajdla, T.: Objects can move: 3d change detection by geometric transformation consistency. In: European Conference on Computer Vision. pp. 108–124. Springer (2022)
  • [2] Bescos, B., Fácil, J.M., Civera, J., Neira, J.: Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3(4), 4076–4083 (2018)
  • [3] Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics 32(6), 1309–1332 (2016)
  • [4] Cai, Z., Müller, M.: Clnerf: Continual learning meets nerf. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23185–23194 (2023)
  • [5] Chen, Q., Ge, T., Xu, Y., Zhang, Z., Yang, X., Gai, K.: Semantic human matting. In: Proceedings of the 26th ACM international conference on Multimedia (2018)
  • [6] Cheng, J., Sun, Y., Meng, M.Q.H.: Improving monocular visual slam in dynamic environments: An optical-flow-based approach. Advanced Robotics 33(12), 576–589 (2019)
  • [7] Chung, C.M., Tseng, Y.C., Hsu, Y.C., Shi, X.Q., Hua, Y.H., Yeh, J.F., Chen, W.C., Chen, Y.T., Hsu, W.H.: Orbeez-slam: A real-time monocular visual slam with orb features and nerf-realized mapping. arXiv preprint arXiv:2209.13274 (2022)
  • [8] Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE Robotics & Automation Magazine 13(2), 99–110 (2006)
  • [9] Fang, J., Yi, T., Wang, X., Xie, L., Zhang, X., Liu, W., Nießner, M., Tian, Q.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers. pp. 1–9 (2022)
  • [10] Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
  • [11] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
  • [12] Hu, X., Zhang, Y., Cao, Z., Ma, R., Wu, Y., Deng, Z., Sun, W.: Cfp-slam: A real-time visual slam based on coarse-to-fine probability in dynamic environments. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4399–4406. IEEE (2022)
  • [13] Johari, M.M., Carta, C., Fleuret, F.: Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. arXiv preprint arXiv:2211.11704 (2022)
  • [14] Kaneko, M., Iwami, K., Ogawa, T., Yamasaki, T., Aizawa, K.: Mask-slam: Robust feature-based monocular slam by masking using semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 258–266 (2018)
  • [15] Karaoglu, M.A., Schieber, H., Schischka, N., Görgülü, M., Grötzner, F., Ladikos, A., Roth, D., Navab, N., Busam, B.: Dynamon: Motion-aware fast and robust camera localization for dynamic nerf. arXiv preprint arXiv:2309.08927 (2023)
  • [16] Kruzhkov, E., Savinykh, A., Karpyshev, P., Kurenkov, M., Yudin, E., Potapov, A., Tsetserukou, D.: Meslam: Memory efficient slam based on neural fields. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2022)
  • [17] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [18] Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6498–6508 (2021)
  • [19] Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
  • [20] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Proceedings of the European Conference on Computer Vision (2020)
  • [21] Ming, Y., Ye, W., Calway, A.: idf-slam: End-to-end rgb-d slam with neural implicit mapping and deep feature tracking. arXiv preprint arXiv:2209.07919 (2022)
  • [22] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
  • [23] Ortiz, J., Clegg, A., Dong, J., Sucar, E., Novotny, D., Zollhoefer, M., Mukadam, M.: isdf: Real-time neural signed distance fields for robot perception. arXiv preprint arXiv:2204.02296 (2022)
  • [24] Palazzolo, E., Behley, J., Lottes, P., Giguère, P., Stachniss, C.: ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In: IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) (2019)
  • [25] Palazzolo, E., Behley, J., Lottes, P., Giguere, P., Stachniss, C.: Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7855–7862. IEEE (2019)
  • [26] Palazzolo, E., Stachniss, C.: Change detection in 3d models based on camera images. In: 9th Workshop on Planning, Perception and Navigation for Intelligent Vehicles at the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS) (2017)
  • [27] Palazzolo, E., Stachniss, C.: Fast image-based geometric change detection given a 3d model. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 6308–6315. IEEE (2018)
  • [28] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [29] Sandström, E., Li, Y., Van Gool, L., Oswald, M.R.: Point-slam: Dense neural point cloud-based slam. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18433–18444 (2023)
  • [30] Saputra, M.R.U., Markham, A., Trigoni, N.: Visual slam and structure from motion in dynamic environments: A survey. ACM Computing Surveys (CSUR) 51(2), 1–36 (2018)
  • [31] Schorghuber, M., Steininger, D., Cabon, Y., Humenberger, M., Gelautz, M.: Slamantic-leveraging semantics to improve vslam in dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp. 0–0 (2019)
  • [32] Scona, R., Jaimez, M., Petillot, Y.R., Fallon, M., Cremers, D.: Staticfusion: Background reconstruction for dense rgb-d slam in dynamic environments. In: 2018 IEEE international conference on robotics and automation (ICRA). pp. 3849–3856. IEEE (2018)
  • [33] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: Proc. of the International Conference on Intelligent Robot Systems (IROS) (2012)
  • [34] Sucar, E., Liu, S., Ortiz, J., Davison, A.J.: imap: Implicit mapping and positioning in real-time. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
  • [35] Suzuki, T.: Federated learning for large-scale scene modeling with neural radiance fields. arXiv preprint arXiv:2309.06030 (2023)
  • [36] Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8248–8258 (2022)
  • [37] Taneja, A., Ballan, L., Pollefeys, M.: Image based detection of geometric changes in urban environments. In: 2011 international conference on computer vision. pp. 2336–2343. IEEE (2011)
  • [38] Taneja, A., Ballan, L., Pollefeys, M.: City-scale change detection in cadastral 3d models using images. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. pp. 113–120 (2013)
  • [39] Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34, 16558–16569 (2021)
  • [40] Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12959–12970 (2021)
  • [41] Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12922–12931 (2022)
  • [42] Ulusoy, A.O., Mundy, J.L.: Image-based 4-d reconstruction using 3-d change detection. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13. pp. 31–45. Springer (2014)
  • [43] Wang, H., Wang, J., Agapito, L.: Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13293–13302 (2023)
  • [44] Wang, Y., Xu, K., Tian, Y., Ding, X.: Drg-slam: A semantic rgb-d slam using geometric features for indoor dynamic scene. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1352–1359. IEEE (2022)
  • [45] Yan, Z., Tian, Y., Shi, X., Guo, P., Wang, P., Zha, H.: Continual neural mapping: Learning an implicit scene representation from sequential observations. In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 15782–15792 (2021)
  • [46] Yang, X., Li, H., Zhai, H., Ming, Y., Liu, Y., Zhang, G.: Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In: 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (2022)
  • [47] Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., Wang, J.: Fast segment anything. arXiv preprint arXiv:2306.12156 (2023)
  • [48] Zhenxing, M., Xu, D.: Switch-nerf: Learning scene decomposition with mixture of experts for large-scale neural radiance fields. In: The Eleventh International Conference on Learning Representations (2022)
  • [49] Zhu, Z., Peng, S., Larsson, V., Xu, W., Bao, H., Cui, Z., Oswald, M.R., Pollefeys, M.: Nice-slam: Neural implicit scalable encoding for slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM Supplementary Material

Baicheng Li Zike Yan Dong Wu Hanqing Jiang Hongbin Zha

7 Supplementary Results

7.1 Segmentation Pruning

Segmentation with FastSAM [47] leads to varying granularities. A person may be separated into different parts as arms, legs, body, and head. In such a case, we wish to retain the instance-level segmentation instead of the part-level decomposition, thereby reducing the number of classifier updates.

Specifically, as illustrated in Fig. 10, for any two segments R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we consider R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be a part of R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and delete it if the portion of overlapped areas between R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is larger than TRsubscript𝑇𝑅T_{R}italic_T start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as:

SR1R2>TR×SR1subscript𝑆subscript𝑅1subscript𝑅2subscript𝑇𝑅subscript𝑆subscript𝑅1S_{R_{1}\cap R_{2}}>T_{R}\times S_{R_{1}}italic_S start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_T start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT × italic_S start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (13)

where TRsubscript𝑇𝑅T_{R}italic_T start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is set to 0.9 in our experiments.

Refer to caption
Figure 10: The mask selection strategy helps to reduce a significant number of unnecessary masks, thereby lessening interference to the system.

7.2 Mesh Evaluation

We follow the dense dynamic SLAM system of ReFusion [25] to evaluate the mesh results quantitatively. As shown in Fig. 11 and  12, our method achieves comparable results with ReFusion [25] and outperforms StaticFusion [32]. Note that NeRF-based methods mainly focus on realistic rendering (as presented in our main paper and supp. video) instead of surface reconstruction, the quantitative results sufficiently verify our map quality in dynamic environments.

Refer to caption
Figure 11: Additional comparison results of mapping and tracking.
Refer to caption
Figure 12: Additional comparison results of mapping and tracking.

7.3 Ablation Study on the Replay Training of the Classifier

We compare the number of updates for the classifier based on whether replay training is performed. The experiment is conducted with different image encoders across two high-dynamic sequences: ’person_tracking’ and ’balloon’. The results are illustrated in Fig. 13. It is clearly observable that the classifier undergoing replay training requires fewer updates compared to the one without replay, regardless of the image encoder used. Detailed analysis can be found in the "continual learning of the classifier" parts of Sec. 4.3 and Sec. 5.3.

Refer to caption
Refer to caption
Figure 13: Classifier update times: with and without replay training.

7.4 Run-time

We also tested the average run-time of each component of the system in the two sequences mentioned above. In high-dynamic environments, we effectively balanced the system’s speed and accuracy. Our method can run at a frame rate of around 1 fps. In contrast, the frame rates of Co-SLAM [43], iMap [34], and NICE-SLAM [49] are approximately 3 fps, 2 fps, and less than 0.1 fps, respectively.

Table 2: Average run-time of Bonn dataset.
Instance feature extraction Tracking Bundle adjustment Classifier updating
231ms 149ms 161ms 397ms

7.5 Visualization of Tracking Results

In Table 1 of the main paper, we present the quantitative results of camera tracking, showing that our method achieves higher accuracy compared to other dynamic SLAM approaches. Here, we also provide a qualitative demonstration of these results. Fig. 14 shows a comparison of our trajectories with the ground truth on three sequences.

Refer to caption
Figure 14: Trajectory plots of three sequences in Bonn RGB-D dataset.
Table 3: ATE (RMS) with different visual encoders.
Sequence BLIP-2 CLIP DINOv2
balloon 0.206 0.211 0.228
synchronous 0.130 0.139 0.146
person_tracking 0.274 0.271 0.278
Table 4: ATE (RMS) with and without prior knowledge.
Sequence w/o prior with prior
balloon 0.206 0.212
synchronous 0.130 0.134
person_tracking 0.274 0.259

7.6 Ablation Study on the Visual Encoders and Prior Knowledge

We tested the impact of different experimental settings on the accuracy of camera tracking. Table 4 and Table 4 present the ablation study results on Bonn RGB-D dataset. As shown in Table 4, using different visual encoders causes slight variations in the tracking results, with BLIP-2 [17] performing the best and DINOv2 [22] performing the worst. Table 4 reflects that the presence of prior knowledge has minimal impact on tracking accuracy. In most cases, our method achieves precise camera tracking without prior knowledge. Only in handling extremely challenging scenarios (such as Fig. 9 in the main paper), do we need to incorporate prior knowledge.

8 The Forgetting Issue of Implicit Neural Representations

The global representation of iMap [34] results in severe catastrophic forgetting if keyframe-based replay is not deployed, as the distribution shift constantly occurs during the sequential data capturing. On the other hand, subsequent NeRF-based SLAMs like Co-SLAM [43] and Point-SLAM [29] introduce the local neural representations. They store local features of the scene on grids or points, which to some extent alleviates this negative impact. However, they also employ a global decoder to interpret local features, leading to catastrophic forgetting of the global decoder if keyframe replay is not performed, thereby affecting the operation of the entire system. Therefore, the keyframe buffer lays the foundation of the neural SLAM systems for distilling all past knowledge jointly. We control the replayed buffer through a continually learned classifier so effects from dynamic objects over the past observations can be instantly eliminated, thereby leading to the forgetting of these dynamic objects and maintaining an invariant map to be updated. The methodology is applicable not only to global neural representations but also to representations with discretely-stored local features.