DN-Splatter: Depth and Normal Priors for
Gaussian Splatting and Meshing

Matias Turkulainen1{}^{*}{}^{1}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Xuqian Ren2{}^{*}{}^{2}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Iaroslav Melekhov3    Otto Seiskari4    Esa Rahtu2    Juho Kannala3,4
1 ETH Zurich, 2 Tampere University, 3 Aalto University, 4 Spectacular AI
Corresponding author: [email protected]
Abstract

High-fidelity 3D reconstruction of common indoor scenes is crucial for VR and AR applications. 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use off-the-shelf monocular networks to achieve better alignment with the true scene geometry. We propose an adaptive depth loss based on the gradient of color images, improving depth estimation and novel view synthesis results over various baselines. Our simple yet effective regularization technique enables direct mesh extraction from the Gaussian representation, yielding more physically accurate reconstructions of indoor scenes. Our code will be released in https://github.com/maturk/dn-splatter.

1 Introduction

Refer to caption
Figure 1: Overview: We use depth and normal priors obtained from common handheld devices and general-purpose networks to enhance Gaussian splatting reconstruction quality. By regularizing Gaussian positions, local smoothness, and orientations, we demonstrate improvements in novel view synthesis and achieve more accurate mesh reconstructions on a variety of challenging indoor room datasets.

The demand for high-fidelity 3D reconstruction of typical environments is increasing due to VR and AR applications. However, photorealistic and accurate 3D reconstruction of common indoor scenes from casually captured sensor data remains a persistent problem in 3D computer vision. Textureless and less-observed regions cause ambiguities in reconstructions and do not provide enough constraints for valid geometric solutions. Recently, neural implicit representations have achieved success in high-fidelity 3D reconstruction by representing scenes as a continuous volume with fully differentiable properties [31, 50, 25, 2]. However, the reconstruction of everyday indoor scenes still poses challenges, even for state-of-the-art methods. These methods rarely achieve good results in both photorealism and geometry reconstruction and often suffer from long training and rendering times, making them inaccessible for general use and VR/AR applications.

3D Gaussian splatting [19] introduces a novel method for inverse rendering by representing a scene by many differentiable 3D Gaussian primitives with optimizable properties. This explicit representation enables real-time rendering of large, complex scenes–a capability that most neural implicit models lack. It results in a more interoperable scene representation since the scene appearance and geometry are directly expressed by the location, shape, and color attributes of Gaussians. However, due to the lack of 3D cues and surface constraints during optimization, artifacts and ambiguities are likely to occur, resulting in floaters and poor surface reconstruction. Scenes often contain millions of Gaussians, and their properties are directly modified by gradient descent based on photometric losses only. Little focus has been given to exploring better regularization techniques that result in visually and geometrically smoother and more plausible 3D reconstructions that can be converted into meshes, an important downstream application.

Although many modern smartphones are equipped with low-resolution depth sensors, these are rarely used for novel view synthesis tasks. Motivated by this and advances in depth and normal estimation networks [3, 53, 7, 1], we explore the regularization of 3D Gaussian splatting with these geometric priors. Our goal is to enhance both photorealism and surface reconstruction in challenging indoor scenes. By designing an optimization strategy for 3D Gaussian splatting with depth and normal priors, we improve novel view synthesis results over baselines while respecting the captured scene geometry. We regularize the position of Gaussians with an edge-aware depth constraint and estimate normals from Gaussians to align them with the real surface boundaries estimated via monocular networks. We show how this simple regularization strategy, illustrated in Fig. 1, enables the extraction of meshes from the Gaussian scene representation, resulting in smoother and more geometrically accurate reconstructions. In summary, we make the following contributions:

  • We design an edge-aware depth loss for Gaussian splatting depth regularization to improve reconstruction on indoor scenes with imperfect depth estimates.

  • We use monocular normal priors to align Gaussians with the scene geometry and demonstrate how this aids reconstruction compared to alternative approaches.

  • We showcase how this regularization strategy enables efficient mesh extraction directly from the Gaussian scene.

2 Related work

Here we give an introduction to image-based rendering (IBR) methods for scene reconstruction and an overview of prior regularization strategies.

2.1 Traditional IBR

Reconstructing 3D geometry from images is a long-standing problem in computer vision. Traditional techniques enabled by Structure-from-Motion (SfM) [43, 40] and multi-view stereo (MVS) [9] techniques focused on reconstructing geometry as a sparse set of 3D points from images [21, 39]. The corresponding learning-based approaches [13, 30, 28, 32] usually replace some parts of the pipeline with differentiable modules, leading to a more accurate scene representation. Some work focuses on dense point reconstruction, normal estimation [41] or constructing triangle meshes from point sets [18, 27] to render novel views. However, traditional methods yield poor reconstruction results in large-scale, textureless scenes. In this paper, we demonstrate that incorporating geometric cues can significantly enhance scene reconstructions.

2.2 Neural implicit IBR

The most success has been achieved with neural-based inverse rendering methods, most notably NeRF [31], which applies volume rendering [17] to represent scenes as continuous volumes with attributes encoded within a neural network. However, the 3D geometry extracted from these scenes is often ill-defined and suffers from artifacts and floaters. Subsequent work has focused on improving the rendering quality and scene reconstruction through regularization techniques and by adapting other scene representations such as signed distance functions [58] or occupancy grids.

Prior regularization. Prior regularization of neural implicit models has been an active area of research. Previous NeRF-based approaches add depth regularization to explicitly supervise ray termination [6, 51] or impose smoothness constraints [34] on rendered depth maps. Other works explore regularizing with multi-view consistency [10, 6, 24] in sparse view settings. For SDF-based models, Manhattan-SDF [11] uses planar constraints on walls and flat surfaces to improve indoor reconstruction, and MonoSDF [58] uses depth and normal monocular estimates for scene geometry regularization. In this work, we investigate the regularization of 3D Gaussian splatting optimization with depth and normal priors to enhance photometric and geometric reconstruction.

Meshable implicit representations. Surface extraction as triangle meshes is an important problem since most computer graphics pipelines still rely on triangle rasterization. Watertight meshes also provide a good approximation of scene geometry and surface quality, leading to the development of various metrics for mesh quality. Prior work has focused on extracting meshes from NeRF representations [37, 46, 45] with some success, but these methods often rely on expensive post-refinement stages. Most state-of-the-art techniques use SDF or occupancy representations [58, 54, 50, 25] combined with marching cubes [27] to achieve finer details. These methods involve querying and evaluating dense 3D volumes, often at multiple levels of detail, and are generally slow to train. In this work, we investigate extracting meshable surfaces directly from the explicit Gaussian scene representation.

Meshable 3D Gaussians. Extracting meshable surfaces from Gaussian primitives is a relatively new topic. Keselman et al. [20] propose generating an oriented point set from a trained Gaussian scene to be meshed with Poisson reconstruction [18], using back-projected depth maps and analytically estimated normals. However, without regularization, this approach results in noisy point clouds that are difficult to mesh. NeuSG [4] addresses this by jointly training a dense SDF neural implicit model with the Gaussian scene, aligning Gaussians with SDF-estimated normals. While this method produces good reconstructions, it has long training times — over 16 hours on high-end GPUs — diminishing the appeal of 3DGS.

SuGaR [12] proposes treating the positions of Gaussians and their densities as intersections of a level set and optimizes a signed-distance loss to converge Gaussians to the surface. Normals are estimated from the derivative of the signed distance, similar to Keselman et al. However, due to the lack of geometric priors, the reconstructions remain noisy. SuGaR refines the coarse mesh through differentiable optimization, making the process computationally costly. In contrast, the recent concurrent method 2DGS [16] proposes to use a Gaussian surfel representation to explicitly model planar surfaces. However, as shown in Table 1 and Table 2, without prior regularization, the results on outward-facing indoor datasets are poor. We demonstrate how to improve mesh extraction and novel view synthesis using depth and normal priors during optimization.

3 Preliminaries

Our work builds on 3D Gaussian splatting (3DGS, [19]) and we briefly describe the rasterization algorithm. 3DGS represents a scene with differentiable 3D Gaussian primitives parameterized by their mean 𝝁3𝝁superscript3\bm{\mu}\in\mathbb{R}^{3}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and covariance matrix 𝚺3×3𝚺superscript33\bm{\Sigma}\in\mathbb{R}^{3\times 3}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT which is decomposed into a scaling vector 𝒔3𝒔superscript3\bm{s}\in\mathbb{R}^{3}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a rotation quaternion 𝒒4𝒒superscript4\bm{q}\in\mathbb{R}^{4}bold_italic_q ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Other parameters include opacity o𝑜o\in\mathbb{R}italic_o ∈ blackboard_R and color 𝒄3𝒄superscript3\bm{c}\in\mathbb{R}^{3}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, represented via spherical harmonics. Rendering a new view involves projecting 3D Gaussians into camera space as 2D Gaussians. These 2D Gaussians are sorted by z-depth in a single global sort and alpha-composited using the discrete volume rendering equation to produce pixel colors 𝐂^^𝐂\bf\hat{C}over^ start_ARG bold_C end_ARG:

𝐂^^𝐂\displaystyle{\bf\hat{C}}over^ start_ARG bold_C end_ARG =iN𝐜iαiTi, where Ti=j=1i1(1αj)formulae-sequenceabsentsubscript𝑖𝑁subscript𝐜𝑖subscript𝛼𝑖subscript𝑇𝑖 where subscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\displaystyle=\sum_{i\in N}{\bf{c}}_{i}\alpha_{i}T_{i},\textrm{ where }T_{i}=% \prod_{j=1}^{i-1}(1-\alpha_{j})= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , where italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (1)

where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the accumulated transmittance at pixel location p𝑝pitalic_p and αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the blending coefficient for a Gaussian with center μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in screen space:

αisubscript𝛼𝑖\displaystyle{\alpha_{i}}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =oiexp(12(𝒑𝝁i)𝚺i1(𝒑𝝁i)).absentsubscript𝑜𝑖12superscript𝒑subscript𝝁𝑖superscriptsubscript𝚺𝑖1𝒑subscript𝝁𝑖\displaystyle=o_{i}\cdot\exp{\left(-\frac{1}{2}(\bm{p}-\bm{\mu}_{i})^{% \intercal}\bm{\Sigma}_{i}^{-1}(\bm{p}-\bm{\mu}_{i})\right)}.= italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_p - bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_p - bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (2)

Gaussian projection and blending operations are parallelized, allowing for real-time rendering performance. The scene is initialized with Gaussian means and colors obtained from SfM [40, 41]. During optimization, the parameters of the Gaussians are updated via gradient descent through many rendering iterations to best fit the training dataset images. The algorithm progressively culls, splits, and duplicates Gaussians in the scene at fixed intervals based on Gaussian opacity, screen-space size, and the magnitude of the gradient of Gaussian means, respectively.

4 Method

In Section 4.1 we utilize sensor and monocular depth priors to regularize Gaussian positions with an edge-aware loss. Next, we extract normal directions from Gaussians and utilize these normal cues for regularization in Section 4.2. Additionally, we add a smoothing prior on rendered normal maps to better align nearby Gaussians during optimization in Section 4.3. Lastly, in Section 4.4, we use the optimized Gaussian scene to directly extract meshes using Poisson surface reconstruction.

4.1 Leveraging depth cues

Depth prediction. Per-pixel z–depth estimates D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG are rendered using the discrete volume rendering approximation similar to color values:

D^^𝐷\displaystyle{\hat{D}}over^ start_ARG italic_D end_ARG =iNdiαij=1i1(1αj)absentsubscript𝑖𝑁subscript𝑑𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\displaystyle=\sum_{i\in N}{{d}}_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j})= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (3)

where disubscriptd𝑖\textit{d}_{i}d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT Gaussian z-depth coordinate in view space. Since 3DGS does not sort Gaussians individually per-pixel along a viewing ray, and instead relies on a single global sort for efficiency; this is only an approximation of per-pixel depth. The z-depth ordering for nearby Gaussians and specific orientations is not guaranteed to be correct after 2D projection and a global sort, as explained in StopThePop[36]. However, sorting by depth for each pixel would be too computationally costly since scenes can contain millions of Gaussians. Nevertheless, this approximation remains effective, particularly for more regular geometries typically encountered in indoor datasets. We normalize depth estimates with the final accumulated transmittance Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT per pixel, ensuring that we correctly estimate a depth value for pixels where the accumulated transmittance does not equal 1. We rasterize color and depths simultaneously per pixel in a single CUDA kernel forward pass, improving inference and training speed compared to separate rendering steps.

Sensor depth regularization. We directly apply depth regularization on predicted depth maps for datasets containing lidar or sensor depth measurements [38, 56, 44]. Common commercial depth sensors, especially low-resolution variants found in consumer devices like iPhones, often produce non-smooth edges at object boundaries and provide inaccurate readings. Based on this observation and inspired by [5, 22], we propose a gradient-aware depth loss for adaptive depth regularization based on the current RGB image. The depth loss is lowered in regions with large image gradients, signifying edges, ensuring that regularization is more enforced on smoother texture-less regions that typically pose challenges for photometric loss alone. Additionally, our experiments (cf. Table 5) show that using a logarithmic penalty results in smoother reconstructions compared to linear or quadratic penalties. This insight drives our formulation of the gradient-aware depth loss, which effectively balances the regularization across different regions of the image, adapting to the scene’s geometry and texture complexity. Specifically, the gradient-aware depth loss is defined as follows:

D^subscript^𝐷\displaystyle\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT =grgb1|D^|log(1+D^D1)absentsubscript𝑔rgb1^𝐷1subscriptnorm^𝐷𝐷1\displaystyle=g_{\text{rgb}}\frac{1}{|\hat{D}|}\sum\log(1+\|\hat{D}-D\|_{1})= italic_g start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ roman_log ( 1 + ∥ over^ start_ARG italic_D end_ARG - italic_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (4)

where grgb=exp(I)subscript𝑔rgbexp𝐼g_{\text{rgb}}=\text{exp}(-\nabla I)italic_g start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT = exp ( - ∇ italic_I ) and I𝐼\nabla I∇ italic_I is the gradient of the current aligned RGB image. |D^|^𝐷|\hat{D}|| over^ start_ARG italic_D end_ARG | indicates the total number of pixels in D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG.

Monocular depth regularization. For datasets containing no depth data, we rely on scale-aligned monocular depth estimation networks for regularization. We use off-the-shelf monocular depth networks, such as ZoeDepth [3] and DepthAnything [53], for dense per-pixel depth priors. We address the scale ambiguity between estimated depths and the scene by comparing them with sparse SfM points, similar to prior work [58, 5]. Specifically, for each monocular depth estimate Dmonosubscript𝐷monoD_{\textrm{mono}}italic_D start_POSTSUBSCRIPT mono end_POSTSUBSCRIPT, we align the scale to match that of the sparse depth map Dsparsesubscript𝐷sparseD_{\textrm{sparse}}italic_D start_POSTSUBSCRIPT sparse end_POSTSUBSCRIPT obtained by projecting SfM points to the camera view. We solve for a per-image scale a𝑎aitalic_a and shift b𝑏bitalic_b parameter using the closed-form linear regression solution to:

a^,b^=argmina,bij(aDmono,ij+b)Dsparse,ij22^𝑎^𝑏subscriptargmin𝑎𝑏subscript𝑖𝑗superscriptsubscriptnorm𝑎subscript𝐷mono𝑖𝑗𝑏subscript𝐷sparse𝑖𝑗22\displaystyle\hat{a},\hat{b}=\operatorname*{arg\,min}_{a,b}\sum_{ij}\|(a*D_{% \text{mono},ij}+b)-D_{\text{sparse},ij}\|_{2}^{2}over^ start_ARG italic_a end_ARG , over^ start_ARG italic_b end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ ( italic_a ∗ italic_D start_POSTSUBSCRIPT mono , italic_i italic_j end_POSTSUBSCRIPT + italic_b ) - italic_D start_POSTSUBSCRIPT sparse , italic_i italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

where we denote Dsparse,ijsubscript𝐷sparse𝑖𝑗D_{\text{sparse},ij}italic_D start_POSTSUBSCRIPT sparse , italic_i italic_j end_POSTSUBSCRIPT and Dmono,ijsubscript𝐷mono𝑖𝑗D_{\text{mono},ij}italic_D start_POSTSUBSCRIPT mono , italic_i italic_j end_POSTSUBSCRIPT as per-pixel correspondences between the two depth maps. We then apply the same loss as in Eq. 4 for regularization.

4.2 Leveraging normal cues

Normal prediction. During optimization, we expect Gaussians to become flat, disc-like, with one scaling axis much smaller than the other two. This smaller scaling axis serves as an approximation of the normal direction. Specifically, we define a geometric normal for a Gaussian using a rotation matrix SO(3)𝑆𝑂3\mathbb{R}\in{SO(3)}blackboard_R ∈ italic_S italic_O ( 3 ), obtained from its quaternion 𝒒𝒒\bm{q}bold_italic_q, and scaling coefficients 𝒔3𝒔superscript3\bm{s}\in\mathbb{R}^{3}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

𝒏^i=ROneHot(argmin(s1,s2,s3))subscriptbold-^𝒏𝑖𝑅OneHotargminsubscript𝑠1subscript𝑠2subscript𝑠3\displaystyle{\bm{\hat{n}}_{i}}=R\cdot\textrm{OneHot}{(\operatorname*{arg\,min% }(s_{1},s_{2},s_{3}))}overbold_^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R ⋅ OneHot ( start_OPERATOR roman_arg roman_min end_OPERATOR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) (6)

where OneHot(.) 3absentsuperscript3\in\mathbb{R}^{3}∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT returns a unit vector with zeros everywhere except at the position where the scaling 𝒔i=(s1,s2,s3)subscript𝒔𝑖subscript𝑠1subscript𝑠2subscript𝑠3\bm{s}_{i}=(s_{1},s_{2},s_{3})bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is minimum. We minimize one of the scaling axes during training to force Gaussians to become disc-like surfels:

scale=iargmin(𝒔i)1.subscriptscalesubscript𝑖subscriptdelimited-∥∥argminsubscript𝒔𝑖1\displaystyle\begin{split}\mathcal{L}_{\text{scale}}&=\sum_{i}\|\operatorname*% {arg\,min}(\bm{s}_{i})\|_{1}.\\ \end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT scale end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_OPERATOR roman_arg roman_min end_OPERATOR ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . end_CELL end_ROW (7)

To ensure correct orientations, we flip the direction of the normals at the beginning of training if the dot product between the current camera viewing direction and the Gaussian normal is negative. Normals are transformed into camera space using the current camera transform and alpha-composited according to the rendering equation to provide a single per-pixel normal estimate:

𝑵^bold-^𝑵\displaystyle{\bm{\hat{N}}}overbold_^ start_ARG bold_italic_N end_ARG =iN𝒏^iαiTi.absentsubscript𝑖𝑁subscriptbold-^𝒏𝑖subscript𝛼𝑖subscript𝑇𝑖\displaystyle=\sum_{i\in N}{{\bm{\hat{n}}}}_{i}\alpha_{i}T_{i}.= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (8)

In contrast to prior research [26, 8], which appends an additional learnable parameter per Gaussian for normal prediction, our approach derives normals directly from the geometry. Consequently, during back-propagation, adjustments to Gaussian scale and rotation parameters, i.e., covariance matrices, directly lead to updates in the normal estimates. Therefore, no additional learnable parameters are needed. Intuitively, this results in Gaussians better conforming to the scene’s geometry, as their orientations and scales are compelled to align with the surface normal.

Monocular normals regularization. Gao et al. [8] propose using pseudo-ground truth normal maps estimated from the gradient of rendered depths for supervision, referred to as D^^𝐷\nabla\hat{D}∇ over^ start_ARG italic_D end_ARG. However, due to noise in rendered depth maps, especially in complex scenes, this method results in artifacts. Instead, we supervise predicted normals using monocular cues obtained from Omnidata [7], which provide much smoother normal estimates. Fig. 2 highlights this difference. We regularize with an L1 loss:

N^=1|N^|𝑵^𝑵1.subscript^𝑁1^𝑁subscriptnormbold-^𝑵𝑵1\displaystyle\mathcal{L}_{\it{\hat{N}}}=\frac{1}{|\hat{N}|}\sum\|\bm{\hat{N}}-% \bm{N}\|_{1}.caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_N end_ARG | end_ARG ∑ ∥ overbold_^ start_ARG bold_italic_N end_ARG - bold_italic_N ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (9)

We further apply a prior on the total variation of predicted normals, encouraging smooth normal predictions at neighboring pixels with:

smooth=1|N^|i,j(|𝑵^i+1,j𝑵^i,j|+|𝑵^i,j+1𝑵^i,j|)subscriptsmooth1^𝑁subscript𝑖𝑗subscriptbold-^𝑵𝑖1𝑗subscriptbold-^𝑵𝑖𝑗subscriptbold-^𝑵𝑖𝑗1subscriptbold-^𝑵𝑖𝑗\displaystyle\mathcal{L}_{\text{smooth}}=\frac{1}{|\hat{N}|}\sum_{i,j}\left(|% \bm{\hat{N}}_{i+1,j}-\bm{\hat{N}}_{i,j}|+|\bm{\hat{N}}_{i,j+1}-\bm{\hat{N}}_{i% ,j}|\right)caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_N end_ARG | end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( | overbold_^ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | + | overbold_^ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_N end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ) (10)

where N^ijsubscript^𝑁𝑖𝑗\hat{N}_{ij}over^ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents estimated normal values at pixel position (i, j). Thus, our normal regularization loss is defined as follows: normal=N^+smoothsubscriptnormalsubscript^𝑁subscriptsmooth\mathcal{L}_{\text{normal}}=\mathcal{L}_{\it{\hat{N}}}+\mathcal{L}_{\text{% smooth}}caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT.

Normal initialization. We initialize Gaussian orientations by estimating normal directions from the initial SfM point cloud using [60], aligning the Gaussian orientations q𝑞qitalic_q based on these estimated normals, and setting one of the scaling axes to be smaller than the others. This initialization helps to speed up convergence.

Refer to caption
(a) Grad:D^^𝐷\nabla\hat{D}∇ over^ start_ARG italic_D end_ARG
Refer to caption
(b) Mono:N^subscript^𝑁\mathcal{L}_{\it{\hat{N}}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT
Refer to caption
(c) Omnidata GT
Refer to caption
(d) iPhone RGB
Figure 2: Depth gradient vs. monocular normal supervision strategy. (a) We observe that using pseudo normal maps derived from the gradient of rendered depths [8] for supervision leads to noisy predicted normals compared to (b) normal supervision by estimates from a pretrained (c) Omnidata model [7].

4.3 Optimization

The final loss we use for optimization is defined as follows:

=rgb+λdepthD^+scale+(λnormalN^+λsmoothN^smoothnormal)subscriptrgbsubscript𝜆depthsubscript^𝐷subscriptscalesubscriptsubscript𝜆normalsubscript^𝑁subscript𝜆smoothsubscriptsubscript^𝑁smoothsubscriptnormal\displaystyle\mathcal{L}=\mathcal{L}_{\text{rgb}}+\lambda_{\text{depth}}% \mathcal{L}_{\hat{D}}+\mathcal{L}_{\text{scale}}+(\underbrace{\lambda_{\text{% normal}}\mathcal{L}_{\it{\hat{N}}}+\lambda_{\text{smooth}}\mathcal{L}_{\hat{N}% _{\text{smooth}}}}_{\mathcal{L}_{\text{normal}}})caligraphic_L = caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT scale end_POSTSUBSCRIPT + ( under⏟ start_ARG italic_λ start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (11)

where rgbsubscriptrgb\mathcal{L}_{\text{rgb}}caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT is the original photometric loss proposed in [19]. We set λdepth=0.2subscript𝜆depth0.2\lambda_{\text{depth}}=0.2italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT = 0.2, λnormal=0.1subscript𝜆normal0.1\lambda_{\text{normal}}=0.1italic_λ start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT = 0.1, and λsmooth=0.5subscript𝜆smooth0.5\lambda_{\text{smooth}}=0.5italic_λ start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT = 0.5 in our experiments.

4.4 Meshing

After optimizing with our depth and normal regularization using Eq. 11, we apply Poisson surface reconstruction [18] to extract a mesh. SuGaR [12] proposes to extract a coarse mesh from the scene by projecting rays from camera views and linearly interpolating intersection points based on the value of the local density (cf. Eq. 2) derived from nearby Gaussians along a ray. However, since scenes contain thousands of Gaussians, the local density along nearby points is non-smooth, resulting in noisy surfaces (cf. Table 7). In the context of depth regularization, we have already ensured that the positions of the Gaussians are well-distributed and aligned along the surface of the scene. Therefore, we directly back-project rendered depth and normal maps from training views to create an oriented point set for meshing. We show qualitative and quantitative differences between various Poisson mesh extraction methods in Table 7.

5 Experiments

In this section, we demonstrate the proposed regularization strategy on mesh extraction and novel view synthesis results using indoor datasets.

Datasets. We focus on indoor datasets and consider the following: a) MuSHRoom [38]: a real-world indoor dataset containing separate training and evaluation camera trajectories; b) ScanNet++ [56]: a real-world indoor dataset with high fidelity 3D geometry and RGB data; and c) Replica [44]: smaller real-world indoor scans.

Baselines. We consider a range of baseline methods for comparison a) state-of-the-art NeRF-based method Nerfacto [45]; b) its depth regularized version Depth-Nerfacto with a direct loss on ray termination distribution for depth supervision similar to DS-NeRF [6]; c) Neusfacto [57] and MonoSDF [58] for SDF-based implicit surface reconstruction; d) baseline 3DGS Splatfacto method based on Nerfstudio v1.1.3 [45]; e) SuGaR [12] – a 3DGS variant for mesh reconstruction; and f) the recent 2DGS [16] method.

Evaluation metrics. We follow standard practice and use PSNR, SSIM and LPIPS metrics for color images and report common depth metrics, similar to [51, 23, 29, 42, 33, 47, 61], to analyze depth quality for datasets containing sensor or ground truth depth data. For mesh evaluation, we follow the evaluation protocol from  [38, 49] and report Accuracy (Acc.), Completion (Comp.), Chamfer-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance (C-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), Normal Consistency (NC), and F-scores (F1) with a threshold of 5cm.

Implementation details. The proposed approach is implemented in PyTorch [35] and gsplat [55] (v1.0.0). We train all models for 30k iterations. To obtain monocular normal cues, we propagate RGB images through the pre-trained Omnidata model [7]. For Poisson reconstruction, we extract a total of 2 million points and use a depth level of 9 for all methods, unless otherwise stated. All meshes are extracted using back-projection of depth and/or normal maps besides Neusfacto and MonoSDF (the Marching cubes) and 2DGS (TSDF). More settings can be seen in the supplementary materials.

ScanNet++: 8b5caf3398MuSHRoom: vr_roomNormalMeshNormalMeshNerfactoDepth-NerfactoMonoSDFSplatfactoSuGaR2DGSDN-SplatterGTRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 3: Mesh reconstruction results. NeRF variants, even with depth supervision, suffer from artefacts and floaters in reconstruction. The Gaussian based methods Splatfacto, SuGaR, and 2DGS are trained on only photometric losses and thus severly struggle to capture the scene geometry in low texture environments. However, adding depth and normal supervision with DN-Splatter greatly aids reconstruction quality.
Sensor Depth Algorithm Accuracy \downarrow Completion \downarrow Chamfer-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow Normal Consistency \uparrow F-score \uparrow
NeRF Nerfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson .0430 .0578 .0504 .7822 .7212
Depth-Nerfacto [45] \checkmark Poisson .0447 .0557 .0502 .7614 .6966
SDF MonoSDF [58] \checkmark Marching-Cubes .0310 .0190 .0250 .8846 .9211
Gaussian Splatfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson .0749 .0555 .0652 .7727 .5835
SuGaR [12] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson+IBR .0656 .0583 .0620 .8031 .6378
2DGS [16] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- TSDF .0731 .0642 .0687 .8008 .6039
DN-Splatter (Ours) \checkmark Poisson .0239 .0194 .0216 .8822 .9243
Table 1: Mesh evaluation: MuSHRoom. The results are averaged over 6 scenes: ”coffee_room”, ”honka”, ”kokko”, ”sauna”, ”computer”, and ”vr_room”. Incorporating depth and normal priors to 3DGS optimization significantly improves mesh reconstruction on challenging real world indoor-datasets. The best and second best results are marked with bold and underline.
Sensor Depth Algorithm Accuracy \downarrow Completion \downarrow Chamfer-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow Normal Consistency \uparrow F-score \uparrow
NeRF Nerfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson .1305 .1484 .1394 .7153 .4698
Depth-Nerfacto [45] \checkmark Poisson .0731 .1647 .1189 .6848 .5018
SDF Neusfacto [57] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Marching-Cubes .0736 .1945 .1340 .7159 .4605
MonoSDF [58] \checkmark Marching-Cubes .0303 .0573 .0438 .8881 .8577
Gaussian Splatfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson .1934 .1503 .1719 .6741 .1790
SuGaR [12] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- Poisson + IBR .0940 .1011 .0975 .7241 .4367
2DGS [16] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- TSDF .1272 .0798 .1035 .7799 .4196
DN-Splatter (Ours) \checkmark Poisson .0487 .0873 .0680 .8404 .6941
Table 2: Mesh evaluation: ScanNet++. The results are averaged over the ”b20a261fdf” and ”8b5caf3398” scenes of the dataset. The best and second best results are marked with bold and underline.

5.1 Mesh evaluation

We demonstrate the effectiveness of the proposed depth and normal regularization strategy on scene geometry by extracting meshes directly after optimization, without any additional refinement steps. In Table 1 and Table 2 we show how incorporating geometric cues enables competitive mesh extraction on scenes from the MuSHRoom [38] and ScanNet++ [56] datasets. We first observe that without the usage of monocular depth cues, neither NeRF- [45] nor Gaussian-based methods [16] work well for indoor reconstruction. Improving depth and normal estimation over the baseline Splatfacto [45] makes our method competitive even against the more computationally expensive SDF approaches [58, 57]. NeRF-based methods fail to consistently achieve similar results, with depth supervision sometimes hindering mesh performance. Fig. 3 provides a qualitative comparison of our mesh extraction approach with other NeRF and 3DGS baselines on the ScanNet++ dataset.

5.2 Novel view synthesis and depth estimation

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to captionPSNR: 20.96
(a) Nerfacto
Refer to captionPSNR: 20.86
(b) Depth-Nerfacto
Refer to captionPSNR: 20.95
(c) MonoSDF
Refer to captionPSNR: 21.94
(d) 2DGS
Refer to captionPSNR: 22.66
(e) Splatfacto
Refer to captionPSNR: 23.01
(f) Ours
Refer to caption
(g) iPhone GT
Figure 4: Qualitative comparison of depth and RGB renders against a variety of baselines. DN-Splatter achieves the highest novel view synthesis results compared to NeRF, SDF, and Gaussian based methods.
Sensor Depth Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25 \uparrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
Nerfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- .0862 / .0747 .0293 / .0141 .0794 / .0667 .9335 / .9428 20.86 / 20.66 .7859 / .7633 .2321 / .2702
Depth-Nerfacto [45] \checkmark .0727 / .0563 .0155 / .0283 .0583 / .2840 .9469 / .9389 21.24 / 18.93 .7832 / .7023 .2414 / .3978
MonoSDF [58] \checkmark .0555 / .0911 .0263 / .0804 .2493 / .4535 .9353 / .8825 20.68 / 20.16 .7357 / .7653 .3590 / .2261
2DGS [16] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- .0864 / .0923 .0612 / .0583 .3799 / .3132 .8820 / .8927 22.52 / 21.73 .8185 / .7898 .1773 / .1911
Splatfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- .0787 / .0817 .0364 / .0521 .2407 / .2941 .9072 / .9061 24.44 / 21.33 .8486 / .7821 .1387 / .2240
Splatfacto + D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT \checkmark .0234 / .0364 .0092 / .0145 .1293 / .1486 .9849 / .9745 24.77 / 21.95 .8538 / .7948 .1238 / .1852
Splatfacto + D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT + normalsubscriptnormal\mathcal{L}_{\text{normal}}caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT \checkmark .0241 / .0340 .0094 / .0123 .1308 / .1472 .9848 / .9777 24.67 / 21.99 .8517 / .7941 .1275 / .1864
DN-Splatter (ours) \checkmark .0228 / .0354 .0089 / .0214 .1280 / .2032 .9854 / .9683 24.58 / 21.89 .8558 / .7984 .1293 / .1797
Table 3: Depth estimation and novel view synthesis: MuSHRoom. The reported results are reported for two distinct evaluation datasets: left/right where left is a test set obtained by sampling uniformly every 10 frames within the training sequence and right is a test split obtained from a different camera trajectory with no overlap with the training sequence. Results are averaged over 5 scenes.

We show an extensive study on depth and normal supervision and its effect on novel view synthesis and depth metrics on challenging real-world scenes from MuSHRoom and ScanNet++ datasets. Table 3 demonstrates that incorporating sensor depth supervision into 3DGS enhances depth and RGB metrics in comparison to scenarios without geometric supervision. We qualitatively highlight the improvements in novel view synthesis and depth quality with the reduction of floaters and artefacts in Fig. 4.

Method C-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow NC \uparrow F1\uparrow PSNR \uparrow SSIM \uparrow Time (min)
No Cues (rgbsubscriptrgb\mathcal{L}_{\text{rgb}}caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT[19, 45] 6.99 78.0 56.3 26.6 .870 12.9
+ Normal (N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT) 7.7 81.3 60.2 26.7 .874 17.2
+ Normal (N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT + smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT) 6.7 82.6 58.9 26.4 .866 17.3
+ Depth (D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT) 2.6 86.8 86.6 27.1 .885 13.1
+ Both (D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT + N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT) 2.4 90.0 87.7 26.9 .883 17.5
+ Both (D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT + N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT + smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT) 2.5 89.6 87.1 26.8 .879 17.9
Table 4: Ablation of geometric supervision. Geometric cues significantly improve reconstruction quality. We report mesh, novel view, and training time metrics (Nvidia 4090 GPU, 30k iterations) on the ”VR room” sequence of the MuSHRoom dataset [38].
(a) iPhone GT[Uncaptioned image] (b) baseline[Uncaptioned image] (c) MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT[Uncaptioned image] (d) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT[Uncaptioned image] (e) LogL1subscriptLogL1\mathcal{L}_{\text{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT[Uncaptioned image] (f) D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT (Ours)[Uncaptioned image]
Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25 \uparrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
Splatfacto [45] .1481 .1345 .5122 .7602 23.41 .9111 .1316
Splatfacto + MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT .0551 .0180 .2104 .9647 23.84 .9140 .1246
Splatfacto + 1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .0351 .0170 .1889 .9758 23.74 .9138 .1235
Splatfacto + LogL1subscriptLogL1\mathcal{L}_{\mathrm{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT .0364 .0180 .1955 .9744 23.77 .9139 .1234
Splatfacto + D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT .0285 .0162 .1790 .9782 23.61 .9122 .1269
Table 5: Ablation of depth losses: ScanNet++. We compare various depth supervision strategies. We observe that the proposed gradient aware 𝒟^subscript^𝒟\mathcal{L_{\hat{D}}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG caligraphic_D end_ARG end_POSTSUBSCRIPT regularizer obtains the best qualitative results, mitigating uncertainties at edges from the raw iPhone depth captures. Zoom in to see the details.

5.3 Ablation studies

Regularization strategy. We evaluate various design choices of the proposed method in Table 4. The normal loss N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT (Eq. 9) helps align Gaussians along the scene geometry leading to a better scene geometry representation. The depth loss D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT (Eq. 4) significantly improves reconstruction quality and novel-view synthesis in ambiguous, textureless regions which are common in indoor scenes. The normal smoothing prior smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT (Eq. 10) enhances normal-completeness for Poisson meshing with a minimal impact on other metrics. Although the smoothing prior’s effect on quantitative metrics is minimal, its importance is evident in the qualitative renders illustrated in Fig. 5.

Depth supervision and losses. We analyze the impact of the proposed depth loss on the ScanNet++ dataset, providing both qualitative and quantitative results in Table 5. We compare several common depth losses: MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT, 1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, LogL1subscriptLogL1\mathcal{L}_{\mathrm{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT [14] and our proposed edge-aware D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT loss, examining their effect on depth and novel view synthesis metrics. Further comparisons of depth loss performance against ground truth lidar scans and definitions can be found in the supplementary material.

(a) N^subscript^𝑁\mathcal{L}_{\it{\hat{N}}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT
Refer to caption
(b) N^subscript^𝑁\mathcal{L}_{\it{\hat{N}}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT+smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT
Refer to caption
(c) Omnidata GT
Refer to caption
(d) iPhone RGB
Refer to caption
Figure 5: Qualitative comparison of normal supervision. (a) We observe that using a direct normal loss N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT results in non-smooth surface estimates whereas (b) using a normal smoothing prior smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT significantly improves DN-Splatter’s normal predictions according to (c) Omnidata  [7] predictions.

We observe that 1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and LogL1subscriptLogL1\mathcal{L}_{\mathrm{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT generally perform the best on color metrics, with the gradient-based logarithmic variant providing the smoothest qualitative reconstructions.

In addition, in Table 6 we evaluate our regularization strategy against other alternatives. We demonstrate that mesh reconstruction using current state-of-the-art monocular [3, 15] or multi-view [59] depth networks is far less accurate compared to reconstruction using low-resolution iPhone sensor depths. Monocular and multi-view depth estimates are aligned using the strategy outlined in Section 4.1 and Eq. 5, and mesh metrics are computed on the ScanNet++ dataset. We also investigate the patch-based Pearson Correlation loss [52] to examine relative depth supervision with monocular depth [15] estimates. Although the method exceeds naive depth supervision with monocular estimates, the reconstruction quality remains inferior compared to using iPhone depths.

Acc. \downarrow Comp. \downarrow C-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow NC \uparrow F-score\uparrow
No supervision .2627 .2091 .2359 .6511 .1343
Monodepth: Zoe-Depth [3] .1751 .2084 .1918 .7420 .1455
Monodepth: Metric3D [15] .1798 .2079 .1938 .7358 .1439
Multi-view depth [59] .3120 .2375 .2748 .6408 .1903
Patch-based depth [52] .1183 .1766 .1474 .7975 .2236
Sensor depth (iPhone) .0609 .1433 .1021 .8130 .5833
Table 6: Ablation of depth supervision. We compare monocular [3, 15], multi-view [15], relative (patch-based) [52], and sensor depth supervision strategies on the ”b20a261fdf” scene of ScanNet++.

Despite the low-resolution sensor depths lacking detail and containing inaccuracies (mostly at object edges), they remain practical for use in real-world indoor scenes. Future research is needed to bridge the performance gap for monocular depth supervision.

Mesh extraction techniques. Lastly, we investigate various Poisson meshing techniques. In Table 7, we demonstrate that extracting oriented point sets from optimized depth and normal maps results in smoother and more realistic reconstructions compared to other methods. We report mesh evaluation metrics for these different techniques. We compare several approaches: directly using trained Gaussian means and normals (total of 512k Gaussians); extraction of surface density at levels 0.1 and 0.5, as proposed in SuGaR  [12]; back-projection of optimized depth and normal maps, as explained in Section 4.4. All models were trained with our depth and normal regularization. To ensure a fair comparison, we set the total number of extracted points to 500k for both the surface density and back-projection methods.

(a) GaussiansRefer to caption (b) Density 0.1Refer to caption (c) Density 0.5Refer to caption (d) OursRefer to caption (e) GTRefer to caption
Acc. \downarrow Comp. \downarrow C-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow NC \uparrow F-score \uparrow
Gaussians .0206 .0412 .0309 .9091 .9117
SuGaR [12]: density 0.1 .0130 .0357 .0243 .9301 .9275
SuGaR [12]: density 0.5 .0083 .0304 .0193 .9309 .9325
Back-projection (ours) .0074 .0312 .0194 .9428 .9310
Table 7: Ablation of Poisson mesh extraction techniques: Replica. We compare naive Gaussian-based meshing, the meshing strategy from [12], and our back-projection approach. All models were trained using the proposed depth and normal objectives.

6 Conclusion

We presented DN-Splatter, a method for depth and normal regularization of 3D Gaussian splatting. This simple yet effective strategy enhances photorealism by improving common novel view synthesis metrics and significantly improving depth estimation and surface quality extracted from the Gaussian scene. We demonstrated that prior regularization is essential for achieving more geometrically valid and consistent reconstructions in challenging indoor scenes. Although we improved over the state-of-the-art methods, our focus was limited to densely captured scenes with relatively still cameras. Future work could address more challenging and sparser data captures, which often suffer from motion blur and other capture artifacts. Additionally, better meshing techniques are needed to optimize Gaussian scene parameters and mesh quality concurrently, as Poisson surface reconstruction is more sensitive to the estimated positions of the points than to their normal estimates. Furthermore, more research is needed to bridge the performance gap between monocular depth estimation and real sensor depth supervision. This remains an area for future work.

References

  • [1] Gwangbin Bae and Andrew J. Davison. Rethinking inductive biases for surface normal estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [2] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields, 2022.
  • [3] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
  • [4] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2023.
  • [5] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. arXiv preprint arXiv:2311.13398, 2023.
  • [6] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised NeRF: Fewer views and faster training for free. In CVPR, June 2022.
  • [7] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
  • [8] Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, and Yao Yao. Relightable 3d gaussian: Real-time point cloud relighting with brdf decomposition and ray tracing. arXiv:2311.16043, 2023.
  • [9] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for community photo collections. In ICCV, pages 1–8, 2007.
  • [10] Guangcong, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. ICCV, 2023.
  • [11] Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang, Guofeng Zhang, Hujun Bao, and Xiaowei Zhou. Neural 3d scene reconstruction with the manhattan-world assumption. In CVPR, 2022.
  • [12] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering, 2023.
  • [13] Wilfried Hartmann, Silvano Galliani, Michal Havlena, Luc Van Gool, and Konrad Schindler. Learned multi-patch similarity. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  • [14] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Dec 2018.
  • [15] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
  • [16] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. In SIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024.
  • [17] James T. Kajiya. The rendering equation. In Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’86, page 143–150, New York, NY, USA, 1986. Association for Computing Machinery.
  • [18] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson Surface Reconstruction. In Alla Sheffer and Konrad Polthier, editors, Symposium on Geometry Processing. The Eurographics Association, 2006.
  • [19] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 42(4), 2023.
  • [20] Leonid Keselman and Martial Hebert. Flexible techniques for differentiable rendering with 3d gaussians. In ICCV, 2023.
  • [21] Georgios Kopanas, Julien Philip, Thomas Leimkühler, and George Drettakis. Point-based neural rendering with per-view optimization. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering), 40(4), June 2021.
  • [22] Elena Kosheleva, Sunil Jaiswal, Faranak Shamsafar, Noshaba Cheema, Klaus Illgner-Fehns, and Philipp Slusallek. Edge-aware consistent stereo video depth estimation. arXiv preprint arXiv:2305.02645, 2023.
  • [23] Uday Kusupati, Shuo Cheng, Rui Chen, and Hao Su. Normal assisted stereo depth estimation. In CVPR, pages 2189–2199, 2020.
  • [24] Yixing Lao, Xiaogang Xu, Zhipeng Cai, Xihui Liu, and Hengshuang Zhao. CorresNeRF: Image correspondence priors for neural radiance fields. In NeurIPS, 2023.
  • [25] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In CVPR, 2023.
  • [26] Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, and Kui Jia. Gs-ir: 3d gaussian splatting for inverse rendering, 2023.
  • [27] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, page 163–169, New York, NY, USA, 1987. Association for Computing Machinery.
  • [28] Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [29] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM TOG, 39(4):71–1, 2020.
  • [30] I. Melekhov, J. Kannala, and E. Rahtu. Image patch matching using convolutional descriptors with euclidean distance. In Proc. ACCVW, 2016.
  • [31] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • [32] Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenović, and Jiři Matas. Working hard to know your neighbor’s margins: local descriptor learning loss. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4829–4840. Curran Associates Inc., 2017.
  • [33] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In ECCV, pages 414–431. Springer, 2020.
  • [34] Michael Niemeyer, Jonathan T. Barron, Ben Mildenhall, Mehdi S. M. Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In CVPR, 2021.
  • [35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035. Curran Associates, Inc., 2019.
  • [36] Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, and Markus Steinberger. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering, 2024.
  • [37] Marie-Julie Rakotosaona, Fabian Manhardt, Diego Martin Arroyo, Michael Niemeyer, Abhijit Kundu, and Federico Tombari. Nerfmeshing: Distilling neural radiance fields into geometrically-accurate 3d meshes. In Proc. of the International Conf. on 3D Vision (3DV), 2024.
  • [38] Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, and Esa Rahtu. Mushroom: Multi-sensor hybrid room dataset for joint 3d reconstruction and novel view synthesis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4508–4517, 2024.
  • [39] Darius Rückert, Linus Franke, and Marc Stamminger. Adop: Approximate differentiable one-pixel point rendering. ACM TOG, 41(4):1–14, 2022.
  • [40] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
  • [41] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
  • [42] Ayan Sinha, Zak Murez, James Bartolozzi, Vijay Badrinarayanan, and Andrew Rabinovich. Deltas: Depth estimation by learning triangulation and densification of sparse points. In ECCV, pages 104–121. Springer, 2020.
  • [43] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. Association for Computing Machinery (ACM), 2006.
  • [44] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael Goesele, Steven Lovegrove, and Richard Newcombe. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  • [45] Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, 2023.
  • [46] Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, and Gang Zeng. Delicate textured mesh recovery from nerf via adaptive surface refinement. arXiv preprint arXiv:2303.02091, 2022.
  • [47] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
  • [48] Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. CVPR, 2022.
  • [49] Jingwen Wang, Tymoteusz Bleja, and Lourdes Agapito. Go-surf: Neural feature grid optimization for fast, high-fidelity rgb-d surface reconstruction. In 2022 International Conference on 3D Vision (3DV), pages 433–442. IEEE, 2022.
  • [50] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  • [51] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV, pages 5610–5619, 2021.
  • [52] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time 360° sparse view synthesis using gaussian splatting. Arxiv, 2023.
  • [53] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv:2401.10891, 2024.
  • [54] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In NeurIPS, 2021.
  • [55] Vickie Ye, Matias Turkulainen, and the Nerfstudio team. gsplat.
  • [56] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In ICCV, 2023.
  • [57] Zehao Yu, Anpei Chen, Bozidar Antic, Songyou Peng, Apratim Bhattacharyya, Michael Niemeyer, Siyu Tang, Torsten Sattler, and Andreas Geiger. Sdfstudio: A unified framework for surface reconstruction, 2022.
  • [58] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS, 2022.
  • [59] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. British Machine Vision Conference (BMVC), 2020.
  • [60] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
  • [61] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, pages 1851–1858, 2017.

Supplementary Material

Matias Turkulainen1{}^{*}{}^{1}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Xuqian Ren2{}^{*}{}^{2}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Iaroslav Melekhov3 Otto Seiskari4 Esa Rahtu2 Juho Kannala3,4
1 ETH Zurich, 2 Tampere University, 3 Aalto University, 4 Spectacular AI
Corresponding author: [email protected]

In this supplementary material, we provide further details regarding our baseline methods and datasets in Appendix A, definitions for our evaluation metrics and losses in Appendix B, and further quantitative and qualitative results in Appendix C and Appendix D respectively.

Appendix A Implementation details

A.1 Baselines

We compare a variety of baseline methods for novel view synthesis, depth estimation, and mesh reconstruction.

Nerfacto. We use the Nerfacto model from Nerfstudio [45] version 1.0.2 in our experiments. We use default settings, disable pose optimization, and predict normals using the proposed method from Ref-NeRF [48]. We use rendered normal and depth maps for Poisson surface reconstruction.

Depth-Nerfacto. We use the depth supervised variant of Nerfacto with a direct loss on ray termination distribution for sensor depth supervision as described in DS-NeRF [6]. Besides this, we use the same settings as for Nerfacto.

Neusfacto. We use default settings provided by Neusfacto from SDFStudio [57] and use the default marching cubes algorithm for meshing.

MonoSDF. We use the recommended settings from MonoSDF [58] and with sensor depth and monocular normal supervision. We set the sensor depth loss multiplier to 0.1 and normal loss multiplier to 0.05. Normal predictions are obtained from omnidata [7].

Splatfacto. The Splatfacto model from Nerfstudio version 1.1.3 and gsplat [55] version 1.0.0 serves as our baseline 3DGS model. This is a faithful re-implementation of the original 3DGS work [19]. We keep all the default settings for the baseline comparison.

SuGaR. We use the official SuGaR [12] source-code. The original code-base, written as an extension to the original 3DGS work [19], supports only COLMAP based datasets (that is, datasets containing a COLMAP database file). We made slight modifications to the original source-code to support non-COLMAP based formats to import camera information and poses directly from a pre-made .json files. We use default settings for training as described in [12]. We use the SDF trained variant in all experiments. We extract both the coarse and refined meshes for evaluation, although the difference in geometry metrics are small between them. We found a small inconsistency in SuGaR’s normal directions for outward facing indoor datasets, which we corrected in our experiments.

2DGS. We use the official 2DGS [16] source-code. Similar to our SuGaR implementation, we made slight modifications to the original source-code to support non-COLMAP based formats to import camera information and poses directly from a pre-made .json files. We use default settings for training as described in [16] and the default meshing strategy using TSDF fusion.

A.2 Datasets

MuSHRoom. We use the official train and evaluation splits from the MuSHRoom [38] dataset. We report evaluation metrics on a) images obtained from uniformly sampling every 10 frames from the training camera trajectory and b) images obtained from a different camera trajecotry. We use the globally optimized COLMAP [40] for both evaluation sequences. We use a total of 5 million points for mesh extraction for Poisson surface reconstruction.

ScanNet++. We use the ”b20a261fdf” and ”8b5caf3398” scenes in our experiments. We use the iPhone sequences with COLMAP registered poses. The sequences contain 358 and 705 registered images respectively. We uniformly load every 5th frame from the sequences from which we reserve every 10th frame for evaluation.

Appendix B Definitions for metrics and losses

B.1 Depth evaluation metrics

For the ScanNet++ and MuSHRoom datasets, we follow  [51, 23, 29, 42, 33, 47, 61] and report depth evaluation metrics, defined in Table 8. We use the Absolute Relative Distance (Abs Rel), Squared Relative Distance (Sq Rel), Root Mean Squared Error RMSE and its logarithmic variant RMSE log, and the Threshold Accuracy (δ<t)𝛿𝑡(\delta<t)( italic_δ < italic_t ) metrics. The Abs Rel metric provides a measure of the average magnitude of the relative error between the predicted depth values and the ground truth depth values. Unlike the Abs Rel metric, the Sq Rel considers the squared relative error between the predicted and ground truth depth values. The RMSE metric calculates the square root of the average of the squared differences between the predicted and the ground-truth values, giving a measure of the magnitude of the error made by the predictions. The RMSE log metric is similar to RMSE but applied in the logarithmic domain, which can be particularly useful for very large depth values. The Threshold accuracy measures the percentage of predicted depth values within a certain threshold factor, δ𝛿\deltaitalic_δ of the ground-truth depth values.

Metric Definition
Abs Rel 1Ni=1N|dipreddigt|digt1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑑𝑖predsuperscriptsubscript𝑑𝑖gtsuperscriptsubscript𝑑𝑖gt\frac{1}{N}\sum_{i=1}^{N}\frac{\left|d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right|}{d_{i}^{\text{gt}}}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT | end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT end_ARG
Sq Rel 1Ni=1N(dipreddigt)2digt1𝑁superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝑑𝑖predsuperscriptsubscript𝑑𝑖gt2superscriptsubscript𝑑𝑖gt\frac{1}{N}\sum_{i=1}^{N}\frac{\left(d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right)^{2}}{d_{i}^{\text{gt}}}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT end_ARG
RMSE 1Ni=1N(dipreddigt)21𝑁superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝑑𝑖predsuperscriptsubscript𝑑𝑖gt2\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right)^{2}}square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
RMSE log 1Ni=1N(logdipredlogdigt)21𝑁superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝑑𝑖predsuperscriptsubscript𝑑𝑖gt2\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(\log d_{i}^{\text{pred}}-\log d_{i}^{% \text{gt}}\right)^{2}}square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_log italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - roman_log italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
Threshold accuracy, δ𝛿\deltaitalic_δ 1Ni=1N[max(dipreddigt,digtdipred)<δ]1𝑁superscriptsubscript𝑖1𝑁delimited-[]superscriptsubscript𝑑𝑖predsuperscriptsubscript𝑑𝑖gtsuperscriptsubscript𝑑𝑖gtsuperscriptsubscript𝑑𝑖pred𝛿\frac{1}{N}\sum_{i=1}^{N}\left[\max\left(\frac{d_{i}^{\text{pred}}}{d_{i}^{% \text{gt}}},\frac{d_{i}^{\text{gt}}}{d_{i}^{\text{pred}}}\right)<\delta\right]divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ roman_max ( divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT end_ARG ) < italic_δ ]
Table 8: Depth Evaluation Metrics. We show definitions for our depth evaluation metrics. dipredsuperscriptsubscript𝑑𝑖predd_{i}^{\text{pred}}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT and digtsuperscriptsubscript𝑑𝑖gtd_{i}^{\text{gt}}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT are predicted and ground-truth depths for the i𝑖iitalic_i-th pixel. δ𝛿\deltaitalic_δ is the threshold factor (e.g., δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25, δ<1.252𝛿superscript1.252\delta<1.25^{2}italic_δ < 1.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, δ<1.253𝛿superscript1.253\delta<1.25^{3}italic_δ < 1.25 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT).

B.2 Mesh evaluation metrics

In Table 9 we provide the definitions for mesh evaluation used throughout the text for comparing predicted and ground truth meshes. We use a threshold of 5cm5𝑐𝑚5cm5 italic_c italic_m for precision, recall, and F-scores. Furthermore, we evaluate mesh quality only within the visibility of the training camera views.

Metric Definition
Accuracy 1|P|𝐩P(min𝐩P𝐩𝐩1)1𝑃subscript𝐩𝑃subscriptsuperscript𝐩superscript𝑃subscriptnorm𝐩superscript𝐩1\frac{1}{|P|}\sum_{\mathbf{p}\in P}\left(\min_{\mathbf{p}^{*}\in P^{*}}\left\|% \mathbf{p}-\mathbf{p}^{*}\right\|_{1}\right)divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT bold_p ∈ italic_P end_POSTSUBSCRIPT ( roman_min start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_p - bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
Completion 1|P|𝐩P(min𝐩P𝐩𝐩1)1superscript𝑃subscriptsuperscript𝐩superscript𝑃subscript𝐩𝑃subscriptnorm𝐩superscript𝐩1\frac{1}{|P^{*}|}\sum_{\mathbf{p}^{*}\in P^{*}}\left(\min_{\mathbf{p}\in P}% \left\|\mathbf{p}-\mathbf{p}^{*}\right\|_{1}\right)divide start_ARG 1 end_ARG start_ARG | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_min start_POSTSUBSCRIPT bold_p ∈ italic_P end_POSTSUBSCRIPT ∥ bold_p - bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
Chamfer-L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT  Accuracy + Completion 2 Accuracy + Completion 2\frac{\text{ Accuracy + Completion }}{2}divide start_ARG Accuracy + Completion end_ARG start_ARG 2 end_ARG
Normal Completion 1|P|𝐩P(𝐧𝐩T𝐧𝐩)1superscript𝑃subscriptsuperscript𝐩superscript𝑃superscriptsubscript𝐧𝐩𝑇subscript𝐧superscript𝐩\frac{1}{|P^{*}|}\sum_{\mathbf{p}^{*}\in P^{*}}\left(\mathbf{n}_{\mathbf{p}}^{% T}\mathbf{n}_{\mathbf{p}^{*}}\right)divide start_ARG 1 end_ARG start_ARG | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_n start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_n start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) s.t. 𝐩=argminpP𝐩𝐩1𝐩𝑝𝑃argminsubscriptnorm𝐩superscript𝐩1\mathbf{p}=\underset{p\in P}{\text{argmin}}\left\|\mathbf{p}-\mathbf{p}^{*}% \right\|_{1}bold_p = start_UNDERACCENT italic_p ∈ italic_P end_UNDERACCENT start_ARG argmin end_ARG ∥ bold_p - bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Normal-Consistency  Normal-Acc+Normal-Comp 2 Normal-Acc+Normal-Comp 2\frac{\text{ Normal-Acc+Normal-Comp }}{2}divide start_ARG Normal-Acc+Normal-Comp end_ARG start_ARG 2 end_ARG
Precision 1|P|𝐩P(min𝐩P𝐩𝐩1<5cm)1𝑃subscript𝐩𝑃subscriptsuperscript𝐩superscript𝑃subscriptnorm𝐩superscript𝐩15𝑐𝑚\frac{1}{|P|}\sum_{\mathbf{p}\in P}\left(\min_{\mathbf{p}^{*}\in P^{*}}\left\|% \mathbf{p}-\mathbf{p}^{*}\right\|_{1}<5cm\right)divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT bold_p ∈ italic_P end_POSTSUBSCRIPT ( roman_min start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_p - bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 5 italic_c italic_m )
Recall 1|P|𝐩P(min𝐩P𝐩𝐩1<5cm)1superscript𝑃subscriptsuperscript𝐩superscript𝑃subscript𝐩𝑃subscriptnorm𝐩superscript𝐩15𝑐𝑚\frac{1}{|P^{*}|}\sum_{\mathbf{p}^{*}\in P^{*}}\left(\min_{\mathbf{p}\in P}% \left\|\mathbf{p}-\mathbf{p}^{*}\right\|_{1}<5cm\right)divide start_ARG 1 end_ARG start_ARG | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_min start_POSTSUBSCRIPT bold_p ∈ italic_P end_POSTSUBSCRIPT ∥ bold_p - bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 5 italic_c italic_m )
F-score 2 Precision  Recall  Precision + Recall2 Precision  Recall  Precision  Recall\frac{2\cdot\text{ Precision }\cdot\text{ Recall }}{\text{ Precision }+\text{ % Recall }}divide start_ARG 2 ⋅ Precision ⋅ Recall end_ARG start_ARG Precision + Recall end_ARG
Table 9: Mesh Evaluation Metrics. P𝑃Pitalic_P and Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are the point clouds sampled from the predicted and the ground truth mesh. npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the normal vector at point 𝐩𝐩\mathbf{p}bold_p.

B.3 Depth losses

For depth supervision, we compare the following variants of loss functions defined in Table 10

Loss Definition
MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT 1|D^|(D^D)21^𝐷superscript^𝐷𝐷2\frac{1}{|\hat{D}|}\sum(\hat{D}-D)^{2}divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ ( over^ start_ARG italic_D end_ARG - italic_D ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1|D^|D^D11^𝐷subscriptnorm^𝐷𝐷1\frac{1}{|\hat{D}|}\sum\|\hat{D}-D\|_{1}divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ ∥ over^ start_ARG italic_D end_ARG - italic_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
LogL1subscriptLogL1\mathcal{L}_{\text{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT 1|D^|log(1+D^D1)1^𝐷1subscriptnorm^𝐷𝐷1\frac{1}{|\hat{D}|}\sum\log(1+\|\hat{D}-D\|_{1})divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ roman_log ( 1 + ∥ over^ start_ARG italic_D end_ARG - italic_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
HuberL1subscriptHuberL1\mathcal{L}_{\text{HuberL1}}caligraphic_L start_POSTSUBSCRIPT HuberL1 end_POSTSUBSCRIPT {DD^1, if DD^1δ,(DD^)2+δ22δ, otherwise.casessubscriptnorm𝐷^𝐷1 if subscriptnorm𝐷^𝐷1𝛿superscript𝐷^𝐷2superscript𝛿22𝛿 otherwise.\begin{cases}\|D-\hat{D}\|_{1},&\text{ if }\|D-\hat{D}\|_{1}\leq\delta,\\ \frac{(D-\hat{D})^{2}+\delta^{2}}{2\delta},&\text{ otherwise. }\end{cases}{ start_ROW start_CELL ∥ italic_D - over^ start_ARG italic_D end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL if ∥ italic_D - over^ start_ARG italic_D end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL divide start_ARG ( italic_D - over^ start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_δ end_ARG , end_CELL start_CELL otherwise. end_CELL end_ROW
DSSIML1subscriptDSSIML1\mathcal{L}_{\text{DSSIML1}}caligraphic_L start_POSTSUBSCRIPT DSSIML1 end_POSTSUBSCRIPT α1SSIM(I,I^)2+(1α)|II^|𝛼1SSIM𝐼^𝐼21𝛼𝐼^𝐼\alpha\frac{1-\operatorname{SSIM}(I,\hat{I})}{2}+(1-\alpha)|I-\hat{I}|italic_α divide start_ARG 1 - roman_SSIM ( italic_I , over^ start_ARG italic_I end_ARG ) end_ARG start_ARG 2 end_ARG + ( 1 - italic_α ) | italic_I - over^ start_ARG italic_I end_ARG |
EASsubscriptEAS\mathcal{L}_{\text{EAS}}caligraphic_L start_POSTSUBSCRIPT EAS end_POSTSUBSCRIPT grgb1|D^|D^D1subscript𝑔rgb1^𝐷subscriptnorm^𝐷𝐷1g_{\text{rgb}}\frac{1}{|\hat{D}|}\sum\|\hat{D}-D\|_{1}italic_g start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ ∥ over^ start_ARG italic_D end_ARG - italic_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT grgb1|D^|log(1+D^D1)subscript𝑔rgb1^𝐷1subscriptnorm^𝐷𝐷1g_{\text{rgb}}\frac{1}{|\hat{D}|}\sum\log(1+\|\hat{D}-D\|_{1})italic_g start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_D end_ARG | end_ARG ∑ roman_log ( 1 + ∥ over^ start_ARG italic_D end_ARG - italic_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
Table 10: Depth Regularization Objectives. We show the definitions for various depth objectives. Here, δ=0.2max(DD^1)𝛿0.2subscriptnorm𝐷^𝐷1\delta=0.2\max(\|D-\hat{D}\|_{1})italic_δ = 0.2 roman_max ( ∥ italic_D - over^ start_ARG italic_D end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), grgb=exp(I)subscript𝑔rgbexp𝐼g_{\text{rgb}}=\text{exp}(-\nabla I)italic_g start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT = exp ( - ∇ italic_I ), D𝐷Ditalic_D/D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG are the ground truth and rendered depths, and I𝐼Iitalic_I/I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG is the ground truth/rendered RGB image.

We compare the performance of these losses as supervision in Table 13.

(a) Test within a sequence (b) Test with a different sequence
Sensor Depth Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow RMSE log \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25 \uparrow Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow RMSE log \downarrow δ<12.5𝛿12.5\delta<12.5italic_δ < 12.5 \uparrow
Nerfacto [45] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- 14.72 19.79 61.05 13.26 88.25 14.52 18.32 63.85 13.13 88.41
Depth-Nerfacto [45] \checkmark 13.90 11.71 50.21 12.98 88.46 13.49 10.76 51.63 12.62 89.23
MonoSDF [58] \checkmark 10.90 9.87 48.74 11.27 83.48 11.00 10.98 50.92 11.37 82.62
Splatfacto (no cues) [19] {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}- 8.32 5.45 38.47 10.23 89.75 8.06 5.39 38.61 10.05 90.51
Splatfacto + D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT (Ours) \checkmark 3.71 3.08 30.80 4.27 95.52 3.78 3.08 31.35 4.26 95.47
Splatfacto + D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT + N^subscript^𝑁\mathcal{L}_{\hat{N}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUBSCRIPT (Ours) \checkmark 3.64 3.02 30.33 4.17 95.60 3.69 2.97 30.57 4.15 95.64
Table 11: Depth evaluation metrics compared to ground truth Faro scanner data for the MuSHRoom dataset. Instead of evaluating on noisy captured iPhone depth maps for evaluation, we rely on more accurate depth maps reconstructed from a Faro lidar scanner. We show that our depth regularization strategy, utilizing low-resolution iPhone depths, greatly outperforms other baselines. Results are averaged over 10 scenes.
(a) We load every 3/5/8/12 views from the whole training sequence (around 260). Results are evaluated on ”Courtroom” from Tanks & Temples.
Methods load every 3 load every 5 load every 8 load every 12
PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
Splatfacto 20.68 .7445 .1921 18.50 .6991 .2110 16.86 .6459 .2474 14.76 .5580 .3332
Ours + Zoe-Depth [3] 20.88 .7518 .1833 19.58 .7118 .2007 17.60 .6568 .2433 15.90 .5835 .2971
Ours + DepthAnything [53] 20.91 .7528 .1830 19.60 .7153 .1997 17.44 .6568 .2456 16.24 .5902 .2924
(b) We load every 5/8/12/20 views from the whole training sequence (around 270). Results are evaluated on ”8b5caf3398” from ScanNet++ DSLR sequence.
Methods load every 5 load every 8 load every 12 load every 20
PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
Splatfacto 24.68 .8810 .1169 22.81 .8568 .1559 21.08 .8357 .1816 18.90 .8059 .2375
Ours + Zoe-Depth [3] 24.72 .8821 .1163 23.04 .8591 .1521 21.81 .8415 .1755 19.10 .8059 .2332
Ours + DepthAnything [53] 24.66 .8826 .1194 23.21 .8595 .1507 21.76 .8406 .1751 19.51 .8101 .2321
Table 12: Comparison of DN-Splatter performance with monocular depth supervision. We ablate the Zoe-Depth[3] and DepthAnything[53] monocular estimators with sparse views on the ”Courtroom” sequence of Tanks & Temples advanced dataset. Monocular depth supervision aids in novel-view synthesis under sparse settings.

Appendix C Additional quantitative results

Here we provide additional quantitative results for DN-Splatter. In Table 11, we show the depth evaluation performance of our proposed regularization scheme on the MuSHRoom dataset, evaluated against ground truth Faro lidar scanner data instead of the low-resolution iPhone depths. This corresponds to Table 3 from the main paper, which compares depth metrics on iPhone depth captures for the same scenes and baselines. When comparing to laser scanner depths, our method still out performs other baseline methods on depth estimation.

We also consider the performance of DN-Splatter supervised by only monocular depth estimates on a large scale Tanks & Temples scene in Table 12. We consider training with dense and sparse captures and conclude that although monocular depth supervision in dense captures provides minimal improvements, the increase in novel view synthesis under sparse settings is notable.

Lastly, in Table 13 we compare the performance of various depth losses described in Section B.3 on depth estimation and novel view synthesis. There are several interesting observations. First, the logarithmic depth loss LLogL1subscript𝐿LogL1L_{\text{LogL1}}italic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT outperforms other popular variants like LL1subscript𝐿L1L_{\text{L1}}italic_L start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT oder LMSEsubscript𝐿MSEL_{\text{MSE}}italic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT on depth and RGB synthesis. Second, the gradient-aware logarithmic depth variant LD^subscript𝐿^𝐷L_{\hat{D}}italic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT outperforms the simpler variant, validating our assumption that captured sensor depths, like those from iPhone cameras, tend to contain noise and inaccuracies at edges or sharp boundaries. Therefore, the gradient-aware variant mitigates these inaccurate sensor readings.

(a) Test split obtained by sampling uniformly every 10 frames within the training sequence.
Depth estimation Novel view synthesis
Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow RMSE log \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25 \uparrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT .0587 .0229 .2313 .0618 .9534 22.32 .7995 .1653
1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .0419 .0233 .2286 .0435 .9629 22.46 .8041 .1594
DSSIML1subscriptDSSIML1\mathcal{L}_{\text{DSSIML1}}caligraphic_L start_POSTSUBSCRIPT DSSIML1 end_POSTSUBSCRIPT .0476 .0331 .2773 .0523 .9476 21.77 .7802 .1879
LogL1subscriptLogL1\mathcal{L}_{\text{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT .0430 .0267 .2414 .0444 .9609 22.48 .8053 .1580
HuberL1subscriptHuberL1\mathcal{L}_{\text{HuberL1}}caligraphic_L start_POSTSUBSCRIPT HuberL1 end_POSTSUBSCRIPT .0536 .0239 .2335 .0561 .9579 22.39 .8017 .1625
EASsubscriptEAS\mathcal{L}_{\text{EAS}}caligraphic_L start_POSTSUBSCRIPT EAS end_POSTSUBSCRIPT .0954 .0572 .3581 .1103 .8726 22.18 .7951 .1780
D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT (Ours) .0338 .0212 .2170 .0350 .9691 22.49 .8031 .1630
(b) Test split obtained from a different camera trajectory with no overlap with the training sequence.
Depth estimation Novel view synthesis
Abs Rel \downarrow Sq Rel \downarrow RMSE \downarrow RMSE log \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25 \uparrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow
MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT .0572 .0282 .2506 .0570 .9585 19.37 .7088 .2329
1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .0449 .0248 .2364 .0449 .9639 19.45 .7164 .2253
DSSIML1subscriptDSSIML1\mathcal{L}_{\text{DSSIML1}}caligraphic_L start_POSTSUBSCRIPT DSSIML1 end_POSTSUBSCRIPT .0482 .0330 .2775 .0527 .9495 18.98 .7040 .2430
LogL1subscriptLogL1\mathcal{L}_{\text{LogL1}}caligraphic_L start_POSTSUBSCRIPT LogL1 end_POSTSUBSCRIPT .0451 .0269 .2454 .0453 .9629 19.50 .7183 .2228
HuberL1subscriptHuberL1\mathcal{L}_{\text{HuberL1}}caligraphic_L start_POSTSUBSCRIPT HuberL1 end_POSTSUBSCRIPT .0526 .0267 .2483 .0533 .9617 19.45 .7128 .2285
EASsubscriptEAS\mathcal{L}_{\text{EAS}}caligraphic_L start_POSTSUBSCRIPT EAS end_POSTSUBSCRIPT .0724 .0442 .3142 .0819 .9329 19.30 .7108 .2351
D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT (Ours) .0427 .0252 .2335 .0420 .9632 19.53 .7187 .2286
Table 13: Ablation on depth losses on the MuSHRoom dataset. We consider various depth losses as defined in Section B.3 and their impact on depth estimation and novel view synthesis. We achieve the best performance with our proposed edge-aware D^subscript^𝐷\mathcal{L}_{\hat{D}}caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT loss.

Appendix D Additional qualitative results

Lastly, we provide additional qualitative results for mesh performance in Fig. 6 as well as depth and novel view renders in Fig. 7, Fig. 8, and Fig. 9, respectively.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Nerfacto [45] Depth-Nerfacto [45] MonoSDF [58] Splatfacto [19, 45] Ours iPhone GT
Figure 6: Qualitative comparison on mesh reconstruction. Comparison of baseline methods on sequences from the MuSHRoom dataset.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 23.30 Refer to captionPSNR: 21.02 Refer to captionPSNR: 24.46 Refer to captionPSNR: 26.56 Refer to captionPSNR: 27.32 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 18.31 Refer to captionPSNR: 17.22 Refer to captionPSNR: 19.50 Refer to captionPSNR: 22.35 Refer to captionPSNR: 23.10 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 19.01 Refer to captionPSNR: 18.60 Refer to captionPSNR: 19.17 Refer to captionPSNR: 18.86 Refer to captionPSNR: 21.48 Refer to caption
Nerfacto [45] Depth-Nerfacto [45] MonoSDF [58] Splatfacto [19, 45] Ours iPhone GT
Figure 7: Qualitative comparison of rendered depth and RGB images. Comparison of baseline methods on the ”sauna” sequence from the MuSHRoom dataset.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 18.11 Refer to captionPSNR: 18.94 Refer to captionPSNR: 16.81 Refer to captionPSNR: 22.83 Refer to captionPSNR: 24.74 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 20.22 Refer to captionPSNR: 21.19 Refer to captionPSNR: 17.37 Refer to captionPSNR: 27.66 Refer to captionPSNR: 27.83 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 28.30 Refer to captionPSNR: 28.44 Refer to captionPSNR: 26.41 Refer to captionPSNR: 29.27 Refer to captionPSNR: 29.37 Refer to caption
Nerfacto [45] Depth-Nerfacto [45] MonoSDF [58] Splatfacto [19, 45] Ours iPhone GT
Figure 8: Qualitative comparison of rendered depth and RGB images. Comparison of baseline methods on the ”classroom” and ”coffee room” sequences from the MuSHRoom dataset.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 23.52 Refer to captionPSNR: 23.62 Refer to captionPSNR: 22.03 Refer to captionPSNR: 24.83 Refer to captionPSNR: 25.59 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 20.16 Refer to captionPSNR: 21.06 Refer to captionPSNR: 21.46 Refer to captionPSNR: 24.18 Refer to captionPSNR: 26.08 Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to captionPSNR: 18.15 Refer to captionPSNR:18.03 Refer to captionPSNR: 19.50 Refer to captionPSNR: 21.88 Refer to captionPSNR: 22.57 Refer to caption
Nerfacto [45] Depth-Nerfacto [45] MonoSDF [58] Splatfacto [19, 45] Ours iPhone GT
Figure 9: Qualitative comparison of rendered depth and RGB images. Comparison of baseline methods on the ”koivu” sequence from the MuSHRoom dataset.