²²footnotetext: ^† Corresponding authors.¹¹institutetext: National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University²²institutetext: AIR, Tsinghua University ³³institutetext: SenseTime Research

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Baicheng Li 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1 Zike Yan^† 2AIR, Tsinghua University 2 Dong Wu 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1 Hanqing Jiang 3SenseTime Research
3 Hongbin Zha^† 1National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University11National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University12AIR, Tsinghua University 21National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University13SenseTime Research
31National Key Lab of GAI, School of IST
PKU-SenseTime Machine Vision Joint Lab
Peking University1

Abstract

Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different views with dynamic objects involved break the geometric and photometric consistency, whereas the consistency lays the foundation for joint optimizing the camera pose and the map parameters. In this paper, we best exploit the characteristics of continual learning and propose a novel SLAM framework for dynamic environments. While past efforts have been made to avoid catastrophic forgetting by exploiting an experience replay strategy, we view forgetting as a desirable characteristic. By adaptively controlling the replayed buffer, the ambiguity caused by moving objects can be easily alleviated through forgetting. We restrain the replay of the dynamic objects by introducing a continually-learned classifier for dynamic object identification. The iterative optimization of the neural map and the classifier notably improves the robustness of the SLAM system under a dynamic environment. Experiments on challenging datasets verify the effectiveness of the proposed framework.

1 Introduction

Refer to caption — Figure 1: We introduce a continual learning based SLAM framework under challenging dynamic environments (top row). The proposed method jointly learns a classifier to alleviate the effects induced by the moving objects (middle row), and a neural map to memorize past observations as a neural radiance field (bottom row). The iterative optimization of pose, map, and classifier parameters forms a robust SLAM system that learns to memorize and to forget adaptively in the changing open world.

Simultaneous localization and mapping (SLAM) describes the instant agent state and the environment where the agent operates. By constructing a consistent map on the fly given sequential observations, the agent gradually gains knowledge of the environment that can be utilized for downstream vision and robotics applications. The consistency in both temporal and spatial domains lays the foundation of the problem since the genesis of SLAM [8], where the view-invariant photometric and geometric cues within the map are leveraged to predict and validate future measurements [3]. Nevertheless, this consistency cannot be guaranteed in the setting as various environmental changes may occur due to the object movements. Alleviating the effects induced by the inconsistency between observations and the map is of great importance for robust long-term deployment.

Intuitively, a robust SLAM system can be achieved if frame-to-model alignment is conducted based on pure static/invariant features [30]. Such a system requires accurate identification of environmental changes in both observations and the map. Conventional methods turn to remove the dynamic objects in observations through motion segmentation, and update the discretized map by heuristically deleting the corresponding areas. Recent advances [45, 34] show that an implicit neural representation can also be updated in a purely static environment through test-time optimization to serve as the map of a dense SLAM system. Besides the experience-replay based continual learning paradigm that most methods adopt to avoid catastrophic forgetting of past observations, we argue that the forgetting is also a nice property to update the neural map given environmental changes. Only the retention of invariant features should be made from the observations, whereas the changing part will be naturally forgotten under constant distribution shifts.

In this work, we introduce a dense neural SLAM framework to tackle challenging dynamic scenarios. The key idea is the continual learning of two modules, a neural map $f(\mathbf{x};\theta^{t}_{M})$ that distills past observations into a continuous neural radiance field, and a binary classifier $g(\mathbf{z};\theta^{t}_{C})$ that records the motion status (static/dynamic) of each instance given the encoded feature $\mathbf{z}$ . The continual learning fashion guarantees online adaption to the instant state of scene geometry, appearance, and object motion status. Both modules accumulate knowledge from sequential observations and automatically decide what to memorize and what to forget. The inconsistent areas between the observation and the map will be identified and not contribute to the pose estimation and map updating. Such iterative optimization of poses, map parameters, and object motion status leads to a robust framework for dynamic SLAM under changing environments. To summarize, our main contributions include:

•

We present for the first time the deployment of a dense neural SLAM framework under challenging dynamic environments. The proposed method leads to reliable motion segmentation, robust camera tracking, and convenient map updating under diverse environmental changes.
•

We propose a continual learning method for updating a classifier that records the motion status of objects within the environment. The instance-aware classifier applies to open-world scenarios and shows positive forward and backward transfer. The module can also integrate prior knowledge regarding potentially movable instances through pre-training.
•

We show that the forgetting mechanism of continual learning can be exploited to update the neural scene representation under changing environmental conditions.

2 Related Work

Visual SLAM in dynamic environments. This line of work removes dynamic objects and reconstructs the static environment. A number of methods [6, 32, 25] utilize warping or reprojection to identify the inconsistency in motion, appearance, or geometry. Although motion segmentation can be generalized to different environments, motion ambiguity leads to typical failures when a large portion of a moving object occupies the image. There is also an attempt [14] that relies on semantic segmentation [11] to pre-define dynamic categories. However, the solution mainly relies on common sense and cannot adapt to the real open world. In addition, some approaches combine motion detection with semantic cues. DynaSLAM [2] and DRG-SLAM [44] filter out features that fall into a pre-defined category or avoid geometric constraints. They might over-segment dynamic areas and leave insufficient information for localization. On the contrary, SLAMANTIC [31] and CFP-SLAM [12] check the observations in the pre-defined category through projection and only remove the features that present inconsistency. Nonetheless, the temporal consistency of motion and its applicability to the open world are commonly ignored. The relevant methods struggle to trade-off between over-segmentation and under-segmentation. Additionally, there exists a category of work dedicated to change detection [1, 26, 27, 42, 37, 38], which can identify areas of change in the environment offline, given known camera poses and a complete point cloud. However, in the context of SLAM, both the camera poses and the complete environmental model are unknown, and the process needs to be conducted online. We effectively fulfill this requirement through a continually-learned classifier.

Dense SLAM with implicit neural representations. The compact and continuous representation power of implicit neural representations [20] draws public attention in the SLAM community. The seminal works of [34] and [45] show that neural representation can be updated through test-time optimization. The follow-ups try different neural representations for better efficiency or accuracy [49, 13, 16, 46, 7, 21, 23, 29]. Recently, Co-SLAM [43] leverages both the high-frequency preserving characteristics of coordinate-based representation and the fast convergence of optimizable feature grids for an accurate and efficient dense SLAM system. Though the above-mentioned methods make great progress in reconstructing static scenes, they are prone to failure when deployed in dynamic environments. We take a step further to address the problem by continually learning what to memorize and what to forget.

NeRF construction in changing environments. Although there is no prior work specified at tackling the NeRF-based SLAM deployment in dynamic environments, some research has been conducted to train a neural radiance field under changing environments. NeRF in the wild [19] pioneers the use of appearance embedding to handle environmental changes. The solution is frequently adopted in follow-up works [41, 36, 48, 35]. There are also works that target the reconstruction of the spatial-temporal 4D field of the environment [18, 40, 10, 9], where per-point radiance and motion are integrated into the neural representation. Recently, CLNeRF [4] leverages continual learning to progressively update the scene representation to adapt to changing environments. Similarly, DynaMoN [15] applies motion segmentation to DROID-SLAM [39] to achieve accurate camera pose estimation in dynamic environments. The estimated poses are then utilized for offline optimization of a 4D NeRF. In contrast, our work directly builds the neural radiance field on the fly in dynamic environments and targets a much more challenging SLAM problem. The invariant information within the environment is stored adaptively under different circumstances.

3 Preliminaries

The central idea of this work is a general framework to mitigate the discrepancy between observations and the stored map under changing environments. The framework is expected to distinguish among past observations what are the invariant features to memorize and what are the changing areas to forget. Through this manner, only reliable features that meet the multi-view consistency over a long period will be reserved, and only errors outside the dynamic areas will be back-propagated to update the pose and map parameters. In practice, we resort to a continual learning fashion, where a neural map and a motion status classifier are trained on the fly that distill knowledge from sequential observations into compact networks. The map will serve as a global memory of the scene radiance, whereas the classifier will serve as a dynamic object detector. The iterative optimization of both networks defines the memorization-forgetting loop for a robust neural SLAM system.

Learn to memorize Following the recent progress of neural SLAM, we adopt the experience-replay based continual learning for test-time map optimization, where a set of keyframes are stored explicitly for back-propagating errors to the network parameter optimization. The keyframes can be viewed as a compressed knowledge of past experience. The map can then memorize the knowledge through gradient-based optimization given the constantly replayed keyframe buffer.

Learn to forget As neural networks exhibit high plasticity, past knowledge can be easily forgotten during constant distribution shift [45]. We expect the framework to adaptively forget the scene dynamics while preserving the invariant information. Note that the past knowledge is controlled by the replayed keyframes, we utilize a classifier to identify the areas within stored keyframes that have been changed. The forgetting in the neural mapping naturally undergoes if the dynamic areas on each keyframe are prohibited from replaying. The classifier should be instance-wise and allow efficient adaptation once the environmental changes occur. The updated motion status can then be passed to the stored keyframes to enforce forgetting.

4 Method

Fig. 2 shows the overview of our dynamic SLAM framework. In practice, streaming RGB-D images $\{I^{t},D^{t}\}_{t=1}^{N}$ with known camera intrinsic $K$ are taken as inputs, and a neural radiance field $f(\mathbf{x};\theta^{t}_{M})$ is updated continually to memorize the static part of the environment. The key to our robust SLAM framework in dynamic environments is a continually learned binary classifier $g(\mathbf{z};\theta^{t}_{C})$ . The past knowledge to be forgotten will be determined through the classifier by identifying inconsistency induced by object movement. Note that the optimization of camera pose $\xi^{t}$ , neural map $\theta_{M}^{t}$ , and motion status classifier $\theta_{C}^{t}$ all rely on the discrepancy between rendered and observed RGB-D images. As illustrated in Fig. 3, the tight coupling of these three variables makes the optimization inherently ambiguous: the divergence of any variables leads to a high discrepancy. In this section, we begin by introducing the photometric and geometric constraints through volume rendering, followed by how these constraints propagate gradients for optimizing the camera pose, map, and classifier parameters iteratively. We argue that this ambiguity can be alleviated with the neural representations of the map and the classifier, where the continuous representations exhibit promising generalization ability and enforce temporally consistent predictions.

4.1 Formulation

The objective is to guarantee a photorealistic and multi-view consistent map representation by minimizing the photometric and geometric errors in static areas between the rendered images and observed images as:

L_{pho}=\frac{1}{|H|}\sum_{(u,v)\in H}\left|I\left[u,v\right]-\hat{I}(\xi,% \theta_{M})\left[u,v\right]\right|,

(1)

L_{geo}=\frac{1}{|H|}\sum_{(u,v)\in H}\left|D\left[u,v\right]-\hat{D}(\xi,% \theta_{M})\left[u,v\right]\right|,

(2)

where $H\subset\mathbbm{1}(R^{t})$ is the set of static samples given motion segmentation mask $R^{t}$ indicated by the classifier. $\hat{I}(\xi^{t},\theta^{t}_{M})$ and $\hat{D}(\xi^{t},\theta^{t}_{M})$ are predicted color and depth images through volume rendering.

In this paper, we follow Co-SLAM [43] to model the scene as a truncated signed distance field $s$ with color values $\mathbf{c}$ as $f(\mathbf{x};\theta)\rightarrow(\mathbf{c},s)$ . Nevertheless, the proposed method can also adopt the conventional NeRF representation with density-based volume rendering as [20]. In this work, the learnable map parameters $\theta=\{\alpha,\phi,\tau\}$ denote the encoder $\mathcal{V}(\mathbf{x};\alpha)$ for grid features and the decoders for color and geometry. The volume rendering for color and depth prediction is a weighted sum of all samples $\{\mathbf{p}_{i}\}_{i=1}^{N}$ along a ray determined by the pixel coordinate and the camera origin as:

\hat{I}(\xi,\theta_{M})[u,v]=\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}% \mathbf{c}_{i},

(3)

\hat{D}(\xi,\theta_{M})[u,v]=\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}d% _{i},

(4)

where $w_{i}$ is the computed weight; $d_{i}$ is the distance between the sampled point $\mathbf{p}_{i}$ and the corresponding camera origin. It should be noted that in Co-SLAM [43], the weight $w$ is computed by multiplying two Sigmoid functions as:

w=\sigma\left(\frac{s}{\lambda_{tr}}\right)\sigma\left(-\frac{s}{\lambda_{tr}}% \right).

(5)

where $\lambda_{tr}$ is the truncation distance.

4.2 Motion-aware Tracking and Mapping

Besides the photometric and geometric constraints in Eq. 1 and 2, we also adopt the SDF losses $L_{sdf}$ in near-surface and free space $L_{free}$ along with the feature smoothness loss $L_{smooth}$ in Co-SLAM [43] for further constraints.

The SDF loss is employed to enhance the quality of map reconstruction, where the depth values of pixels are utilized as SDF approximations. For points within the truncated region $(|D[u,v]-\hat{D}[u,v]|\leq tr)$ :

L_{sdf}=\frac{1}{|R_{H}|}\sum_{r\in R_{H}}\frac{1}{|S_{r}^{tr}|}\sum_{p\in S_{% r}^{tr}}\big{(}s_{p}-(D[u,v]-\hat{D}(\xi,\theta_{M})[u,v])\big{)}^{2}

(6)

where $R_{H}$ represents the set of rays corresponding to the static pixel set $H$ .

For points outside of the truncation region $(|D[u,v]-\hat{D}[u,v]|>tr)$ , we calculate the free space loss:

L_{free}=\frac{1}{|R_{H}|}\sum_{r\in R_{H}}\frac{1}{|S_{r}\setminus S_{r}^{tr}% |}\sum_{p\in S_{r}\setminus S_{r}^{tr}}(s_{p}-tr)^{2},

(7)

To mitigate the noisy reconstruction of unvisited areas induced by hash collision, a smoothness loss for the interpolated features is applied:

L_{smooth}=\frac{1}{|\mathcal{G}|}\sum_{p\in\mathcal{G}}\Delta_{x}^{2}+\Delta_% {y}^{2}+\Delta_{z}^{2},

(8)

where $\Delta x,y,z=V_{\alpha}(p+\epsilon_{x,y,z})-V_{\alpha}(p)$ refers to the change in feature metrics between adjacent vertices sampled on the hash-grid.

We apply ADAM optimizer on the weighted sum of these five loss terms:

L=L_{pho}+\lambda_{1}L_{geo}+\lambda_{2}L_{sdf}+\lambda_{3}L_{free}+\lambda_{4% }L_{smooth}.

(9)

Note that in the loss function, only errors in the static areas will be back-propagated for optimizing the pose and map parameters, where the motion segmentation mask $R^{t}$ is inferred using the instance-wise classifier $g(\mathbf{z};\theta^{t}_{C})$ in Sec. 4.3. During the tracking process, we randomly select $|H_{t}|$ pixels within the static area of the current frame and optimize the estimated camera pose by minimizing the objective function with fixed map and classifier parameters.

Similarly, the bundle adjustment for jointly optimizing camera poses and map parameters is carried out by randomly selecting $|H_{b}|$ pixels within the static area of the past keyframes and the current frame. Stored keyframes serve as past experience to avoid catastrophic forgetting [45, 34]. As the error of pixels in the dynamic areas is inhibited from back-propagation, only knowledge in the static areas will be reserved in the neural map. As illustrated in Fig. 4, the moving object will be gradually forgotten in the neural map given constantly propagated errors from recent observations.

4.3 Updating of Object Motion Status

As illustrated in Fig. 2, the motion status is inferred at the instance level. An image $I^{t}$ is decomposed into $K^{t}$ segments $\{S_{k}^{t}\}_{k=1}^{K^{t}}$ using FastSAM [47], where each segment is then fed to a visual context encoder [28, 17, 22]. Such decomposition and encoding process turns the image into $K^{t}$ vectors $\{\mathbf{z}_{k}^{t}\}_{k=1}^{K^{t}}$ to represent $K^{t}$ different instances. Intuitively, we aim to maintain the motion status of each instance across time. Explicitly aggregating information of each instance across views is non-trivial as the dynamic objects lead to inconsistency. We turn to a simple but effective solution that implicitly records the motion status of each instance using a two-layer MLP as $g(\mathbf{z};\theta_{C}^{t})$ . The network serves as a classifier that determines the motion status of the corresponding instance feature, where an instance will be treated as a moving object if $g(\mathbf{z}^{t}_{k};\theta_{C}^{t})>0.5$ .

The object movement leads to inconsistency between the observation and the stored knowledge. As emitted radiance is view-dependent [20], we merely examine the geometry inconsistency between the rendered depth map and the observed one. The moving status $o_{k}^{t}$ of the instance $k$ will be treated as True if a certain portion of pixels within the segment $S^{t}_{k}$ exhibit large discrepancies as:

\frac{\sum_{(u,v)\in S_{k}^{t}}\mathbbm{1}\left(\hat{D}(\xi,\theta_{M})[u,v]-D% [u,v]>t_{d}\right)}{|S_{k}^{t}|}>t_{p},

(10)

where $t_{d}$ and $t_{p}$ controls the sensitivity of motion segmentation.

The classifier will be updated if the current moving status of any instance $o_{k}^{t}$ contradicts the classification result $\mathbbm{1}(g(\mathbf{z}^{t}_{k};\theta_{C}^{t-1})>0.5)$ . The network is optimized using the binary cross-entropy loss as:

-\sum_{k=1}^{K^{t}}\big{[}o_{k}^{t}\log(g(\mathbf{z}^{t}_{k};\theta_{C}^{*}))+% (1-o_{k}^{t})\log(1-g(\mathbf{z}^{t}_{k};\theta_{C}^{*}))\big{]}.

(11)

Once the classifier is updated, the motion segmentation mask of all keyframes will be updated. The observations regarding the moving object will no longer contribute to the pose and map optimization. As the neural map is continually trained under constant distribution shift, the knowledge regarding the moving object will be gradually forgotten without manual operations.

Continual learning of the classifier. Catastrophic forgetting is not only an issue that neural SLAM system need to address, it also has a negative impact on the classifier. If we train the classifier using only the objects that appear in the current frame, catastrophic forgetting could occur, causing the classifier to forget previously learned dynamic objects. Therefore, we also maintain a replay buffer for the classifier to avoid this issue. For all instances in the current frame, if an instance is marked as dynamic after passing through a check for geometric inconsistency, it will be added to this replay buffer. When updating the classifier, training is conducted not only with instances from the current frame but also by randomly selecting $n_{c}$ instances from the replay buffer as training data. $n_{c}$ is typically set to 5 in our experiments.

Bidirectional check. Eq. 10 identifies a moving object if it is observed in front of the rendered place. However, if an object is removed from the scene or if the object moves away from the camera since the first frame, the observation may fall behind the rendered one. Therefore, we apply a bidirectional check. Similar to Eq. 10, we decompose the rendered image $\hat{D}(\xi,\theta_{M})$ into $\hat{K}^{t}$ segments $\{\hat{S}_{k}^{t}\}_{k=1}^{\hat{K}^{t}}$ and apply the consistency check as:

\frac{\sum_{(u,v)\in\hat{S}_{k}^{t}}\mathbbm{1}\left(D[u,v]-\hat{D}(\xi,\theta% _{M})[u,v]>t_{d}\right)}{|\hat{S}_{k}^{t}|}>t_{p}.

(12)

Incorporation of prior knowledge. One promising characteristic of the open-set classifier is that we can apply prior knowledge through pre-training to pre-define the instance motion status while allowing adaptation to the current state. Specifically, we can train a classifier $\theta_{C}^{pre}$ on specific categories, e.g., human, car, animal. The object motion status can then be identified as "dynamic" if either the pre-trained one or the online-learned one agrees.

Maintaining the instant object state. In the above-mentioned setting, we would label any object that has been moved as "dynamic" even if it is replaced in a new area and remains static afterward. It is due to the fact that the classifier merely establishes the mapping between the instance feature and the corresponding motion state. Thanks to the flexible structure of the classifier, we can concatenate an additional position feature along with the semantic embedding as the inputs. The classifier turns to record the motion state of an object at a specific position (object center $\mathbf{p}_{k}^{t}$ for instance) as $g(\mathbf{z}_{k}^{t},\mathbf{p}_{k}^{t};\theta_{C}^{t})$ . As we show in the experiment, this simple strategy leads to a neural map that forgets the removed object and memorizes the object at the new position gradually.

5 Experiments

We evaluate the proposed method on multiple sequences of TUM RGB-D dataset [33] and Bonn RGB-D dataset [24]. The quality of estimated poses, neural map, and motion status reasoning are analyzed qualitatively and quantitatively. As the first NeRF-SLAM-based method that tackles challenging dynamic environments, we compare the proposed method against traditional methods of ReFusion [25], StaticFusion [32], and DynaSLAM [2] that are designed specifically to dynamic environments. We also compare against our codebase of Co-SLAM [43] to see how the proposed strategies guarantee robust camera tracking with promising reconstruction quality.

5.1 Experimental Setup

The experiments are conducted on a desktop PC with an Intel i9-12900K CPU, an NVIDIA RTX 3090 GPU, and 64GB memory. The keyframe is automatically stored every 5 frames.

Specific to implementation details, the number of sampling points along each ray $N=128$ , sampling pixels during tracking and bundle adjustment $H_{t}=1024,H_{b}=2048$ , truncation distance $tr=10cm$ , threshold used for determining moving status $t_{d}=0.3,t_{p}=0.05$ . For the weights of each loss in Eq. 9, we follow the settings of Co-SLAM as: $\lambda_{1}=0.1,\lambda_{2}=5000,\lambda_{3}=10,\lambda_{4}=1e-8$ .

Table 1: Comparisons of ATE (RMS) against traditional SLAM algorithms designed for deployment in dynamic environments.

{tblr}

column3 = c, column4 = c, column5 = c, column6 = c, column7 = c, cell21 = r=9, cell111 = r=6, hline1-2 = -, hline11,17 = -, & Sequence ReFusion [25] StaticFusion [32] DynaSLAM [2] Co-SLAM [43] Ours
Bonn balloon 0.175 0.233 0.050 0.308 0.206
balloon2 0.254 0.293 0.142 0.290 0.136
kidnapping_box 0.148 0.336 0.026 0.095 0.112
kidnapping_box2 0.161 0.263 0.033 0.118 0.104
crowd 0.204 3.586 1.065 fail 0.116
crowd2 0.155 0.215 1.217 fail 0.200
crowd3 0.137 0.168 0.835 fail 0.107
person_tracking 0.289 0.484 0.714 fail 0.274
synchronous 0.441 0.446 0.977 0.634 0.130
TUM fr3_walking_static 0.017 0.015 0.014 fail 0.025
fr3_walking_xyz 0.099 0.093 0.085 fail 0.076
fr3_walking_halfsphere 0.104 0.681 0.084 fail 0.079
fr3_sitting_static 0.009 0.014 0.009 0.011 0.007
fr3_sitting_xyz 0.040 0.039 0.009 0.020 0.018
fr3_sitting_halfsphere 0.110 0.041 0.017 0.042 0.039

5.2 Tracking and Mapping

As illustrated in Fig. 3, the photometric and geometric errors in dynamic areas lead to inherent ambiguities as they will be back-propagated to modify the pose estimates and mapping results. As tracking and mapping are coupled during the optimization, the erroneous tracking and the fusion of dynamic objects into the map would eventually lead to system collapse. The proposed method, on the other hand, achieves robust camera tracking results even in the challenging ’crowd’ sequences. As presented in Tab. 1, the proposed method achieves better results compared against the feature-based DynaSLAM [2] and the dense SLAM systems of ReFusion [25] and StaticFusion [32]. The effects arising from the dynamic objects are well alleviated by the proposed motion status classifier.

The forgetting and memorization. The forgetting mechanism is demonstrated in Fig. 4. Once the classifier identifies the man as a moving object, the knowledge will no longer act on the map parameters. Meanwhile, the areas occluded by the man will appear once the man goes away, and the inconsistency between the rendered image and the observation in other views will contribute to the map updating. As demonstrated in Sec. 4.3, one interesting fact is that we can also maintain the updated object status through a simple coordinate concatenation. As illustrated in Fig. 5, the box will be forgotten after being kidnapped, and reappear on the map after being placed.

5.3 Motion Segmentation

As illustrated in Fig. 6, the continual learning of the classifier leads to dynamic updating of the object motion status. The successive changes in motion status can be well reflected by the geometric inconsistency, and the classifier can quickly identify the motion changes and learn the instant motion status. The encoded feature of masked objects can well distinguish between two people even if they fall into the same "person" category.

Continual learning of the classifier. As mentioned in Sec. 4.3, the classifier could also encounter catastrophic forgetting. Therefore, we implement continual learning for the classifier by maintaining a replay buffer. As demonstrated in Fig. 7, we select a specific frame (frame 190) for observation. It is evident that the classifier with replay maintains its ability for accurate motion segmentation on the earlier frame over time. In contrast, the classifier without continual learning is likely to perform poorly on the previous training data after updates, segmenting the dynamic instances incorrectly. Specifically, the person in red gradually walks out of the frame (see the second column of Fig. 7). Meanwhile, the classifier undergoes an update. Without replay, training solely on instances from the current frame, the classifier forgets that the person in red is also dynamic. This oversight leads to the person erroneously appearing in the map, further affecting subsequent calculations of geometric inconsistency. This experiment strongly proves that the continual learning strategy designed for the classifier is both effective and necessary.

The choices of visual encoders. To further understand how the motion segmentation behavior is affected by encoded features, we look into the motion prediction given a fixed classifier. As illustrated in Fig. 8, after training the classifier given supervisory signal from the frame $I^{40}$ , we freeze the network $\theta_{C}^{40}$ and predict the motion status of the human with the following observations $I^{40},\cdots,I^{280}$ every 30 frames as $g(\mathbf{z}_{k}^{40:280};\theta_{C}^{40})$ . We compare three different visual encoders (BLIP [17], CLIP [28], and DinoV2 [22]) and visualize the predicted motion state and the encoded feature similarity compared to the frame $I^{40}$ . Apparently, the powerful semantic information of language-aligned visual context presents promising temporal consistency even if the object poses change drastically. The consistent feature embedding across views makes the implicit classifier instance-aware and easy to adapt. We argue that the promising prediction capability leads to robust camera tracking under complex scene dynamics.

Incorporating pre-defined prior knowledge. As demonstrated in Sec. 4.3, the proposed framework allows convenient incorporation of prior knowledge by pre-training a classifier on specific categories. An exemplary case is illustrated in Fig. 9. Here we experiment using a segment after frame 410 from the freiburg3 walking static sequence, where individuals occupy a significant portion of the frame from the outset. By pre-defining humans as dynamic objects and pre-training a classifier using a human matting dataset [5], the ambiguity induced by human motion can be alleviated at the very beginning. On the contrary, the classifier without prior knowledge will have trouble figuring out the actual factor that leads to inconsistency in such a challenging case. As illustrated in the figure, the accurate static map can be reconstructed even though the moving objects occupy nearly half of the frame when the stream begins.

6 Conclusion

In this paper, we address the dynamic SLAM problem with a neural map representation. By learning an instance-aware classifier online that implicitly records the per-object motion status, the invariant information within the observations can be continually distilled to the neural map, where the interference induced by dynamic objects is best alleviated. The iterative optimization of camera pose, map, and classifier parameters forms a robust SLAM framework in challenging dynamic environments.

Acknowledgement

We gratefully acknowledge the anonymous reviewers and AC for their valuable comments and suggestions. This work is supported by NSFC (U22A2061, 62176010) and 230601GP0004.

References

[1] Adam, A., Sattler, T., Karantzalos, K., Pajdla, T.: Objects can move: 3d change detection by geometric transformation consistency. In: European Conference on Computer Vision. pp. 108–124. Springer (2022)
[2] Bescos, B., Fácil, J.M., Civera, J., Neira, J.: Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3(4), 4076–4083 (2018)
[3] Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics 32(6), 1309–1332 (2016)
[4] Cai, Z., Müller, M.: Clnerf: Continual learning meets nerf. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23185–23194 (2023)
[5] Chen, Q., Ge, T., Xu, Y., Zhang, Z., Yang, X., Gai, K.: Semantic human matting. In: Proceedings of the 26th ACM international conference on Multimedia (2018)
[6] Cheng, J., Sun, Y., Meng, M.Q.H.: Improving monocular visual slam in dynamic environments: An optical-flow-based approach. Advanced Robotics 33(12), 576–589 (2019)
[7] Chung, C.M., Tseng, Y.C., Hsu, Y.C., Shi, X.Q., Hua, Y.H., Yeh, J.F., Chen, W.C., Chen, Y.T., Hsu, W.H.: Orbeez-slam: A real-time monocular visual slam with orb features and nerf-realized mapping. arXiv preprint arXiv:2209.13274 (2022)
[8] Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE Robotics & Automation Magazine 13(2), 99–110 (2006)
[9] Fang, J., Yi, T., Wang, X., Xie, L., Zhang, X., Liu, W., Nießner, M., Tian, Q.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers. pp. 1–9 (2022)
[10] Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
[11] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
[12] Hu, X., Zhang, Y., Cao, Z., Ma, R., Wu, Y., Deng, Z., Sun, W.: Cfp-slam: A real-time visual slam based on coarse-to-fine probability in dynamic environments. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4399–4406. IEEE (2022)
[13] Johari, M.M., Carta, C., Fleuret, F.: Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. arXiv preprint arXiv:2211.11704 (2022)
[14] Kaneko, M., Iwami, K., Ogawa, T., Yamasaki, T., Aizawa, K.: Mask-slam: Robust feature-based monocular slam by masking using semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 258–266 (2018)
[15] Karaoglu, M.A., Schieber, H., Schischka, N., Görgülü, M., Grötzner, F., Ladikos, A., Roth, D., Navab, N., Busam, B.: Dynamon: Motion-aware fast and robust camera localization for dynamic nerf. arXiv preprint arXiv:2309.08927 (2023)
[16] Kruzhkov, E., Savinykh, A., Karpyshev, P., Kurenkov, M., Yudin, E., Potapov, A., Tsetserukou, D.: Meslam: Memory efficient slam based on neural fields. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2022)
[17] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[18] Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6498–6508 (2021)
[19] Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
[20] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Proceedings of the European Conference on Computer Vision (2020)
[21] Ming, Y., Ye, W., Calway, A.: idf-slam: End-to-end rgb-d slam with neural implicit mapping and deep feature tracking. arXiv preprint arXiv:2209.07919 (2022)
[22] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
[23] Ortiz, J., Clegg, A., Dong, J., Sucar, E., Novotny, D., Zollhoefer, M., Mukadam, M.: isdf: Real-time neural signed distance fields for robot perception. arXiv preprint arXiv:2204.02296 (2022)
[24] Palazzolo, E., Behley, J., Lottes, P., Giguère, P., Stachniss, C.: ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In: IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) (2019)
[25] Palazzolo, E., Behley, J., Lottes, P., Giguere, P., Stachniss, C.: Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7855–7862. IEEE (2019)
[26] Palazzolo, E., Stachniss, C.: Change detection in 3d models based on camera images. In: 9th Workshop on Planning, Perception and Navigation for Intelligent Vehicles at the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS) (2017)
[27] Palazzolo, E., Stachniss, C.: Fast image-based geometric change detection given a 3d model. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 6308–6315. IEEE (2018)
[28] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[29] Sandström, E., Li, Y., Van Gool, L., Oswald, M.R.: Point-slam: Dense neural point cloud-based slam. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18433–18444 (2023)
[30] Saputra, M.R.U., Markham, A., Trigoni, N.: Visual slam and structure from motion in dynamic environments: A survey. ACM Computing Surveys (CSUR) 51(2), 1–36 (2018)
[31] Schorghuber, M., Steininger, D., Cabon, Y., Humenberger, M., Gelautz, M.: Slamantic-leveraging semantics to improve vslam in dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp. 0–0 (2019)
[32] Scona, R., Jaimez, M., Petillot, Y.R., Fallon, M., Cremers, D.: Staticfusion: Background reconstruction for dense rgb-d slam in dynamic environments. In: 2018 IEEE international conference on robotics and automation (ICRA). pp. 3849–3856. IEEE (2018)
[33] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: Proc. of the International Conference on Intelligent Robot Systems (IROS) (2012)
[34] Sucar, E., Liu, S., Ortiz, J., Davison, A.J.: imap: Implicit mapping and positioning in real-time. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
[35] Suzuki, T.: Federated learning for large-scale scene modeling with neural radiance fields. arXiv preprint arXiv:2309.06030 (2023)
[36] Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8248–8258 (2022)
[37] Taneja, A., Ballan, L., Pollefeys, M.: Image based detection of geometric changes in urban environments. In: 2011 international conference on computer vision. pp. 2336–2343. IEEE (2011)
[38] Taneja, A., Ballan, L., Pollefeys, M.: City-scale change detection in cadastral 3d models using images. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. pp. 113–120 (2013)
[39] Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34, 16558–16569 (2021)
[40] Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12959–12970 (2021)
[41] Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12922–12931 (2022)
[42] Ulusoy, A.O., Mundy, J.L.: Image-based 4-d reconstruction using 3-d change detection. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13. pp. 31–45. Springer (2014)
[43] Wang, H., Wang, J., Agapito, L.: Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13293–13302 (2023)
[44] Wang, Y., Xu, K., Tian, Y., Ding, X.: Drg-slam: A semantic rgb-d slam using geometric features for indoor dynamic scene. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1352–1359. IEEE (2022)
[45] Yan, Z., Tian, Y., Shi, X., Guo, P., Wang, P., Zha, H.: Continual neural mapping: Learning an implicit scene representation from sequential observations. In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 15782–15792 (2021)
[46] Yang, X., Li, H., Zhai, H., Ming, Y., Liu, Y., Zhang, G.: Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In: 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (2022)
[47] Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., Wang, J.: Fast segment anything. arXiv preprint arXiv:2306.12156 (2023)
[48] Zhenxing, M., Xu, D.: Switch-nerf: Learning scene decomposition with mixture of experts for large-scale neural radiance fields. In: The Eleventh International Conference on Learning Representations (2022)
[49] Zhu, Z., Peng, S., Larsson, V., Xu, W., Bao, H., Cui, Z., Oswald, M.R., Pollefeys, M.: Nice-slam: Neural implicit scalable encoding for slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM Supplementary Material

Baicheng Li Zike Yan^† Dong Wu Hanqing Jiang Hongbin Zha^†

7 Supplementary Results

7.1 Segmentation Pruning

Segmentation with FastSAM [47] leads to varying granularities. A person may be separated into different parts as arms, legs, body, and head. In such a case, we wish to retain the instance-level segmentation instead of the part-level decomposition, thereby reducing the number of classifier updates.

Specifically, as illustrated in Fig. 10, for any two segments $R_{1}$ and $R_{2}$ , we consider $R_{1}$ to be a part of $R_{2}$ and delete it if the portion of overlapped areas between $R_{1}$ and $R_{2}$ is larger than $T_{R}$ as:

S_{R_{1}\cap R_{2}}>T_{R}\times S_{R_{1}}

(13)

where $T_{R}$ is set to 0.9 in our experiments.

7.2 Mesh Evaluation

We follow the dense dynamic SLAM system of ReFusion [25] to evaluate the mesh results quantitatively. As shown in Fig. 11 and 12, our method achieves comparable results with ReFusion [25] and outperforms StaticFusion [32]. Note that NeRF-based methods mainly focus on realistic rendering (as presented in our main paper and supp. video) instead of surface reconstruction, the quantitative results sufficiently verify our map quality in dynamic environments.

7.3 Ablation Study on the Replay Training of the Classifier

We compare the number of updates for the classifier based on whether replay training is performed. The experiment is conducted with different image encoders across two high-dynamic sequences: ’person_tracking’ and ’balloon’. The results are illustrated in Fig. 13. It is clearly observable that the classifier undergoing replay training requires fewer updates compared to the one without replay, regardless of the image encoder used. Detailed analysis can be found in the "continual learning of the classifier" parts of Sec. 4.3 and Sec. 5.3.

7.4 Run-time

We also tested the average run-time of each component of the system in the two sequences mentioned above. In high-dynamic environments, we effectively balanced the system’s speed and accuracy. Our method can run at a frame rate of around 1 fps. In contrast, the frame rates of Co-SLAM [43], iMap [34], and NICE-SLAM [49] are approximately 3 fps, 2 fps, and less than 0.1 fps, respectively.

Table 2: Average run-time of Bonn dataset.

Instance feature extraction	Tracking	Bundle adjustment	Classifier updating
231ms	149ms	161ms	397ms

7.5 Visualization of Tracking Results

In Table 1 of the main paper, we present the quantitative results of camera tracking, showing that our method achieves higher accuracy compared to other dynamic SLAM approaches. Here, we also provide a qualitative demonstration of these results. Fig. 14 shows a comparison of our trajectories with the ground truth on three sequences.

Table 3: ATE (RMS) with different visual encoders.

Sequence	BLIP-2	CLIP	DINOv2
balloon	0.206	0.211	0.228
synchronous	0.130	0.139	0.146
person_tracking	0.274	0.271	0.278

Table 4: ATE (RMS) with and without prior knowledge.

Sequence	w/o prior	with prior
balloon	0.206	0.212
synchronous	0.130	0.134
person_tracking	0.274	0.259

7.6 Ablation Study on the Visual Encoders and Prior Knowledge

We tested the impact of different experimental settings on the accuracy of camera tracking. Table 4 and Table 4 present the ablation study results on Bonn RGB-D dataset. As shown in Table 4, using different visual encoders causes slight variations in the tracking results, with BLIP-2 [17] performing the best and DINOv2 [22] performing the worst. Table 4 reflects that the presence of prior knowledge has minimal impact on tracking accuracy. In most cases, our method achieves precise camera tracking without prior knowledge. Only in handling extremely challenging scenarios (such as Fig. 9 in the main paper), do we need to incorporate prior knowledge.

8 The Forgetting Issue of Implicit Neural Representations

The global representation of iMap [34] results in severe catastrophic forgetting if keyframe-based replay is not deployed, as the distribution shift constantly occurs during the sequential data capturing. On the other hand, subsequent NeRF-based SLAMs like Co-SLAM [43] and Point-SLAM [29] introduce the local neural representations. They store local features of the scene on grids or points, which to some extent alleviates this negative impact. However, they also employ a global decoder to interpret local features, leading to catastrophic forgetting of the global decoder if keyframe replay is not performed, thereby affecting the operation of the entire system. Therefore, the keyframe buffer lays the foundation of the neural SLAM systems for distilling all past knowledge jointly. We control the replayed buffer through a continually learned classifier so effects from dynamic objects over the past observations can be instantly eliminated, thereby leading to the forgetting of these dynamic objects and maintaining an invariant map to be updated. The methodology is applicable not only to global neural representations but also to representations with discretely-stored local features.