DN-Splatter: Depth and Normal Priors for
Gaussian Splatting and Meshing

Matias Turkulainen

{}^{*}{}^{1}

Xuqian Ren

{}^{*}{}^{2}

Iaroslav Melekhov³ Otto Seiskari⁴ Esa Rahtu² Juho Kannala^3,4
¹ ETH Zurich, ² Tampere University, ³ Aalto University, ⁴ Spectacular AI
Corresponding author: [email protected]

Abstract

High-fidelity 3D reconstruction of common indoor scenes is crucial for VR and AR applications. 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use off-the-shelf monocular networks to achieve better alignment with the true scene geometry. We propose an adaptive depth loss based on the gradient of color images, improving depth estimation and novel view synthesis results over various baselines. Our simple yet effective regularization technique enables direct mesh extraction from the Gaussian representation, yielding more physically accurate reconstructions of indoor scenes. Our code will be released in https://github.com/maturk/dn-splatter.

1 Introduction

Refer to caption — Figure 1: Overview: We use depth and normal priors obtained from common handheld devices and general-purpose networks to enhance Gaussian splatting reconstruction quality. By regularizing Gaussian positions, local smoothness, and orientations, we demonstrate improvements in novel view synthesis and achieve more accurate mesh reconstructions on a variety of challenging indoor room datasets.

The demand for high-fidelity 3D reconstruction of typical environments is increasing due to VR and AR applications. However, photorealistic and accurate 3D reconstruction of common indoor scenes from casually captured sensor data remains a persistent problem in 3D computer vision. Textureless and less-observed regions cause ambiguities in reconstructions and do not provide enough constraints for valid geometric solutions. Recently, neural implicit representations have achieved success in high-fidelity 3D reconstruction by representing scenes as a continuous volume with fully differentiable properties [31, 50, 25, 2]. However, the reconstruction of everyday indoor scenes still poses challenges, even for state-of-the-art methods. These methods rarely achieve good results in both photorealism and geometry reconstruction and often suffer from long training and rendering times, making them inaccessible for general use and VR/AR applications.

3D Gaussian splatting [19] introduces a novel method for inverse rendering by representing a scene by many differentiable 3D Gaussian primitives with optimizable properties. This explicit representation enables real-time rendering of large, complex scenes–a capability that most neural implicit models lack. It results in a more interoperable scene representation since the scene appearance and geometry are directly expressed by the location, shape, and color attributes of Gaussians. However, due to the lack of 3D cues and surface constraints during optimization, artifacts and ambiguities are likely to occur, resulting in floaters and poor surface reconstruction. Scenes often contain millions of Gaussians, and their properties are directly modified by gradient descent based on photometric losses only. Little focus has been given to exploring better regularization techniques that result in visually and geometrically smoother and more plausible 3D reconstructions that can be converted into meshes, an important downstream application.

Although many modern smartphones are equipped with low-resolution depth sensors, these are rarely used for novel view synthesis tasks. Motivated by this and advances in depth and normal estimation networks [3, 53, 7, 1], we explore the regularization of 3D Gaussian splatting with these geometric priors. Our goal is to enhance both photorealism and surface reconstruction in challenging indoor scenes. By designing an optimization strategy for 3D Gaussian splatting with depth and normal priors, we improve novel view synthesis results over baselines while respecting the captured scene geometry. We regularize the position of Gaussians with an edge-aware depth constraint and estimate normals from Gaussians to align them with the real surface boundaries estimated via monocular networks. We show how this simple regularization strategy, illustrated in Fig. 1, enables the extraction of meshes from the Gaussian scene representation, resulting in smoother and more geometrically accurate reconstructions. In summary, we make the following contributions:

•

We design an edge-aware depth loss for Gaussian splatting depth regularization to improve reconstruction on indoor scenes with imperfect depth estimates.
•

We use monocular normal priors to align Gaussians with the scene geometry and demonstrate how this aids reconstruction compared to alternative approaches.
•

We showcase how this regularization strategy enables efficient mesh extraction directly from the Gaussian scene.

2 Related work

Here we give an introduction to image-based rendering (IBR) methods for scene reconstruction and an overview of prior regularization strategies.

2.1 Traditional IBR

Reconstructing 3D geometry from images is a long-standing problem in computer vision. Traditional techniques enabled by Structure-from-Motion (SfM) [43, 40] and multi-view stereo (MVS) [9] techniques focused on reconstructing geometry as a sparse set of 3D points from images [21, 39]. The corresponding learning-based approaches [13, 30, 28, 32] usually replace some parts of the pipeline with differentiable modules, leading to a more accurate scene representation. Some work focuses on dense point reconstruction, normal estimation [41] or constructing triangle meshes from point sets [18, 27] to render novel views. However, traditional methods yield poor reconstruction results in large-scale, textureless scenes. In this paper, we demonstrate that incorporating geometric cues can significantly enhance scene reconstructions.

2.2 Neural implicit IBR

The most success has been achieved with neural-based inverse rendering methods, most notably NeRF [31], which applies volume rendering [17] to represent scenes as continuous volumes with attributes encoded within a neural network. However, the 3D geometry extracted from these scenes is often ill-defined and suffers from artifacts and floaters. Subsequent work has focused on improving the rendering quality and scene reconstruction through regularization techniques and by adapting other scene representations such as signed distance functions [58] or occupancy grids.

Prior regularization. Prior regularization of neural implicit models has been an active area of research. Previous NeRF-based approaches add depth regularization to explicitly supervise ray termination [6, 51] or impose smoothness constraints [34] on rendered depth maps. Other works explore regularizing with multi-view consistency [10, 6, 24] in sparse view settings. For SDF-based models, Manhattan-SDF [11] uses planar constraints on walls and flat surfaces to improve indoor reconstruction, and MonoSDF [58] uses depth and normal monocular estimates for scene geometry regularization. In this work, we investigate the regularization of 3D Gaussian splatting optimization with depth and normal priors to enhance photometric and geometric reconstruction.

Meshable implicit representations. Surface extraction as triangle meshes is an important problem since most computer graphics pipelines still rely on triangle rasterization. Watertight meshes also provide a good approximation of scene geometry and surface quality, leading to the development of various metrics for mesh quality. Prior work has focused on extracting meshes from NeRF representations [37, 46, 45] with some success, but these methods often rely on expensive post-refinement stages. Most state-of-the-art techniques use SDF or occupancy representations [58, 54, 50, 25] combined with marching cubes [27] to achieve finer details. These methods involve querying and evaluating dense 3D volumes, often at multiple levels of detail, and are generally slow to train. In this work, we investigate extracting meshable surfaces directly from the explicit Gaussian scene representation.

Meshable 3D Gaussians. Extracting meshable surfaces from Gaussian primitives is a relatively new topic. Keselman et al. [20] propose generating an oriented point set from a trained Gaussian scene to be meshed with Poisson reconstruction [18], using back-projected depth maps and analytically estimated normals. However, without regularization, this approach results in noisy point clouds that are difficult to mesh. NeuSG [4] addresses this by jointly training a dense SDF neural implicit model with the Gaussian scene, aligning Gaussians with SDF-estimated normals. While this method produces good reconstructions, it has long training times — over 16 hours on high-end GPUs — diminishing the appeal of 3DGS.

SuGaR [12] proposes treating the positions of Gaussians and their densities as intersections of a level set and optimizes a signed-distance loss to converge Gaussians to the surface. Normals are estimated from the derivative of the signed distance, similar to Keselman et al. However, due to the lack of geometric priors, the reconstructions remain noisy. SuGaR refines the coarse mesh through differentiable optimization, making the process computationally costly. In contrast, the recent concurrent method 2DGS [16] proposes to use a Gaussian surfel representation to explicitly model planar surfaces. However, as shown in Table 1 and Table 2, without prior regularization, the results on outward-facing indoor datasets are poor. We demonstrate how to improve mesh extraction and novel view synthesis using depth and normal priors during optimization.

3 Preliminaries

Our work builds on 3D Gaussian splatting (3DGS, [19]) and we briefly describe the rasterization algorithm. 3DGS represents a scene with differentiable 3D Gaussian primitives parameterized by their mean $\bm{\mu}\in\mathbb{R}^{3}$ and covariance matrix $\bm{\Sigma}\in\mathbb{R}^{3\times 3}$ which is decomposed into a scaling vector $\bm{s}\in\mathbb{R}^{3}$ and a rotation quaternion $\bm{q}\in\mathbb{R}^{4}$ . Other parameters include opacity $o\in\mathbb{R}$ and color $\bm{c}\in\mathbb{R}^{3}$ , represented via spherical harmonics. Rendering a new view involves projecting 3D Gaussians into camera space as 2D Gaussians. These 2D Gaussians are sorted by z-depth in a single global sort and alpha-composited using the discrete volume rendering equation to produce pixel colors $\bf\hat{C}$ :

\displaystyle{\bf\hat{C}}

\displaystyle=\sum_{i\in N}{\bf{c}}_{i}\alpha_{i}T_{i},\textrm{ where }T_{i}=% \prod_{j=1}^{i-1}(1-\alpha_{j})

(1)

where $T_{i}$ is the accumulated transmittance at pixel location $p$ and $\alpha_{i}$ is the blending coefficient for a Gaussian with center $\mu_{i}$ in screen space:

\displaystyle{\alpha_{i}}

\displaystyle=o_{i}\cdot\exp{\left(-\frac{1}{2}(\bm{p}-\bm{\mu}_{i})^{% \intercal}\bm{\Sigma}_{i}^{-1}(\bm{p}-\bm{\mu}_{i})\right)}.

(2)

Gaussian projection and blending operations are parallelized, allowing for real-time rendering performance. The scene is initialized with Gaussian means and colors obtained from SfM [40, 41]. During optimization, the parameters of the Gaussians are updated via gradient descent through many rendering iterations to best fit the training dataset images. The algorithm progressively culls, splits, and duplicates Gaussians in the scene at fixed intervals based on Gaussian opacity, screen-space size, and the magnitude of the gradient of Gaussian means, respectively.

4 Method

In Section 4.1 we utilize sensor and monocular depth priors to regularize Gaussian positions with an edge-aware loss. Next, we extract normal directions from Gaussians and utilize these normal cues for regularization in Section 4.2. Additionally, we add a smoothing prior on rendered normal maps to better align nearby Gaussians during optimization in Section 4.3. Lastly, in Section 4.4, we use the optimized Gaussian scene to directly extract meshes using Poisson surface reconstruction.

4.1 Leveraging depth cues

Depth prediction. Per-pixel z–depth estimates $\hat{D}$ are rendered using the discrete volume rendering approximation similar to color values:

\displaystyle{\hat{D}}

\displaystyle=\sum_{i\in N}{{d}}_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j})

(3)

where $\textit{d}_{i}$ is the $i^{\text{th}}$ Gaussian z-depth coordinate in view space. Since 3DGS does not sort Gaussians individually per-pixel along a viewing ray, and instead relies on a single global sort for efficiency; this is only an approximation of per-pixel depth. The z-depth ordering for nearby Gaussians and specific orientations is not guaranteed to be correct after 2D projection and a global sort, as explained in StopThePop[36]. However, sorting by depth for each pixel would be too computationally costly since scenes can contain millions of Gaussians. Nevertheless, this approximation remains effective, particularly for more regular geometries typically encountered in indoor datasets. We normalize depth estimates with the final accumulated transmittance $T_{i}$ per pixel, ensuring that we correctly estimate a depth value for pixels where the accumulated transmittance does not equal 1. We rasterize color and depths simultaneously per pixel in a single CUDA kernel forward pass, improving inference and training speed compared to separate rendering steps.

Sensor depth regularization. We directly apply depth regularization on predicted depth maps for datasets containing lidar or sensor depth measurements [38, 56, 44]. Common commercial depth sensors, especially low-resolution variants found in consumer devices like iPhones, often produce non-smooth edges at object boundaries and provide inaccurate readings. Based on this observation and inspired by [5, 22], we propose a gradient-aware depth loss for adaptive depth regularization based on the current RGB image. The depth loss is lowered in regions with large image gradients, signifying edges, ensuring that regularization is more enforced on smoother texture-less regions that typically pose challenges for photometric loss alone. Additionally, our experiments (cf. Table 5) show that using a logarithmic penalty results in smoother reconstructions compared to linear or quadratic penalties. This insight drives our formulation of the gradient-aware depth loss, which effectively balances the regularization across different regions of the image, adapting to the scene’s geometry and texture complexity. Specifically, the gradient-aware depth loss is defined as follows:

\displaystyle\mathcal{L}_{\hat{D}}

\displaystyle=g_{\text{rgb}}\frac{1}{|\hat{D}|}\sum\log(1+\|\hat{D}-D\|_{1})

(4)

where $g_{\text{rgb}}=\text{exp}(-\nabla I)$ and $\nabla I$ is the gradient of the current aligned RGB image. $|\hat{D}|$ indicates the total number of pixels in $\hat{D}$ .

Monocular depth regularization. For datasets containing no depth data, we rely on scale-aligned monocular depth estimation networks for regularization. We use off-the-shelf monocular depth networks, such as ZoeDepth [3] and DepthAnything [53], for dense per-pixel depth priors. We address the scale ambiguity between estimated depths and the scene by comparing them with sparse SfM points, similar to prior work [58, 5]. Specifically, for each monocular depth estimate $D_{\textrm{mono}}$ , we align the scale to match that of the sparse depth map $D_{\textrm{sparse}}$ obtained by projecting SfM points to the camera view. We solve for a per-image scale $a$ and shift $b$ parameter using the closed-form linear regression solution to:

\displaystyle\hat{a},\hat{b}=\operatorname*{arg\,min}_{a,b}\sum_{ij}\|(a*D_{% \text{mono},ij}+b)-D_{\text{sparse},ij}\|_{2}^{2}

(5)

where we denote $D_{\text{sparse},ij}$ and $D_{\text{mono},ij}$ as per-pixel correspondences between the two depth maps. We then apply the same loss as in Eq. 4 for regularization.

4.2 Leveraging normal cues

Normal prediction. During optimization, we expect Gaussians to become flat, disc-like, with one scaling axis much smaller than the other two. This smaller scaling axis serves as an approximation of the normal direction. Specifically, we define a geometric normal for a Gaussian using a rotation matrix $\mathbb{R}\in{SO(3)}$ , obtained from its quaternion $\bm{q}$ , and scaling coefficients $\bm{s}\in\mathbb{R}^{3}$ :

\displaystyle{\bm{\hat{n}}_{i}}=R\cdot\textrm{OneHot}{(\operatorname*{arg\,min% }(s_{1},s_{2},s_{3}))}

(6)

where OneHot(.) $\in\mathbb{R}^{3}$ returns a unit vector with zeros everywhere except at the position where the scaling $\bm{s}_{i}=(s_{1},s_{2},s_{3})$ is minimum. We minimize one of the scaling axes during training to force Gaussians to become disc-like surfels:

\displaystyle\begin{split}\mathcal{L}_{\text{scale}}&=\sum_{i}\|\operatorname*% {arg\,min}(\bm{s}_{i})\|_{1}.\\ \end{split}

(7)

To ensure correct orientations, we flip the direction of the normals at the beginning of training if the dot product between the current camera viewing direction and the Gaussian normal is negative. Normals are transformed into camera space using the current camera transform and alpha-composited according to the rendering equation to provide a single per-pixel normal estimate:

\displaystyle{\bm{\hat{N}}}

\displaystyle=\sum_{i\in N}{{\bm{\hat{n}}}}_{i}\alpha_{i}T_{i}.

(8)

In contrast to prior research [26, 8], which appends an additional learnable parameter per Gaussian for normal prediction, our approach derives normals directly from the geometry. Consequently, during back-propagation, adjustments to Gaussian scale and rotation parameters, i.e., covariance matrices, directly lead to updates in the normal estimates. Therefore, no additional learnable parameters are needed. Intuitively, this results in Gaussians better conforming to the scene’s geometry, as their orientations and scales are compelled to align with the surface normal.

Monocular normals regularization. Gao et al. [8] propose using pseudo-ground truth normal maps estimated from the gradient of rendered depths for supervision, referred to as $\nabla\hat{D}$ . However, due to noise in rendered depth maps, especially in complex scenes, this method results in artifacts. Instead, we supervise predicted normals using monocular cues obtained from Omnidata [7], which provide much smoother normal estimates. Fig. 2 highlights this difference. We regularize with an L1 loss:

\displaystyle\mathcal{L}_{\it{\hat{N}}}=\frac{1}{|\hat{N}|}\sum\|\bm{\hat{N}}-% \bm{N}\|_{1}.

(9)

We further apply a prior on the total variation of predicted normals, encouraging smooth normal predictions at neighboring pixels with:

\displaystyle\mathcal{L}_{\text{smooth}}=\frac{1}{|\hat{N}|}\sum_{i,j}\left(|% \bm{\hat{N}}_{i+1,j}-\bm{\hat{N}}_{i,j}|+|\bm{\hat{N}}_{i,j+1}-\bm{\hat{N}}_{i% ,j}|\right)

(10)

where $\hat{N}_{ij}$ represents estimated normal values at pixel position (i, j). Thus, our normal regularization loss is defined as follows: $\mathcal{L}_{\text{normal}}=\mathcal{L}_{\it{\hat{N}}}+\mathcal{L}_{\text{% smooth}}$ .

Normal initialization. We initialize Gaussian orientations by estimating normal directions from the initial SfM point cloud using [60], aligning the Gaussian orientations $q$ based on these estimated normals, and setting one of the scaling axes to be smaller than the others. This initialization helps to speed up convergence.

4.3 Optimization

The final loss we use for optimization is defined as follows:

\displaystyle\mathcal{L}=\mathcal{L}_{\text{rgb}}+\lambda_{\text{depth}}% \mathcal{L}_{\hat{D}}+\mathcal{L}_{\text{scale}}+(\underbrace{\lambda_{\text{% normal}}\mathcal{L}_{\it{\hat{N}}}+\lambda_{\text{smooth}}\mathcal{L}_{\hat{N}% _{\text{smooth}}}}_{\mathcal{L}_{\text{normal}}})

(11)

where $\mathcal{L}_{\text{rgb}}$ is the original photometric loss proposed in [19]. We set $\lambda_{\text{depth}}=0.2$ , $\lambda_{\text{normal}}=0.1$ , and $\lambda_{\text{smooth}}=0.5$ in our experiments.

4.4 Meshing

After optimizing with our depth and normal regularization using Eq. 11, we apply Poisson surface reconstruction [18] to extract a mesh. SuGaR [12] proposes to extract a coarse mesh from the scene by projecting rays from camera views and linearly interpolating intersection points based on the value of the local density (cf. Eq. 2) derived from nearby Gaussians along a ray. However, since scenes contain thousands of Gaussians, the local density along nearby points is non-smooth, resulting in noisy surfaces (cf. Table 7). In the context of depth regularization, we have already ensured that the positions of the Gaussians are well-distributed and aligned along the surface of the scene. Therefore, we directly back-project rendered depth and normal maps from training views to create an oriented point set for meshing. We show qualitative and quantitative differences between various Poisson mesh extraction methods in Table 7.

5 Experiments

In this section, we demonstrate the proposed regularization strategy on mesh extraction and novel view synthesis results using indoor datasets.

Datasets. We focus on indoor datasets and consider the following: a) MuSHRoom [38]: a real-world indoor dataset containing separate training and evaluation camera trajectories; b) ScanNet++ [56]: a real-world indoor dataset with high fidelity 3D geometry and RGB data; and c) Replica [44]: smaller real-world indoor scans.

Baselines. We consider a range of baseline methods for comparison a) state-of-the-art NeRF-based method Nerfacto [45]; b) its depth regularized version Depth-Nerfacto with a direct loss on ray termination distribution for depth supervision similar to DS-NeRF [6]; c) Neusfacto [57] and MonoSDF [58] for SDF-based implicit surface reconstruction; d) baseline 3DGS Splatfacto method based on Nerfstudio v1.1.3 [45]; e) SuGaR [12] – a 3DGS variant for mesh reconstruction; and f) the recent 2DGS [16] method.

Evaluation metrics. We follow standard practice and use PSNR, SSIM and LPIPS metrics for color images and report common depth metrics, similar to [51, 23, 29, 42, 33, 47, 61], to analyze depth quality for datasets containing sensor or ground truth depth data. For mesh evaluation, we follow the evaluation protocol from [38, 49] and report Accuracy (Acc.), Completion (Comp.), Chamfer- $L_{1}$ distance (C- $L_{1}$ ), Normal Consistency (NC), and F-scores (F1) with a threshold of 5cm.

Implementation details. The proposed approach is implemented in PyTorch [35] and gsplat [55] (v1.0.0). We train all models for 30k iterations. To obtain monocular normal cues, we propagate RGB images through the pre-trained Omnidata model [7]. For Poisson reconstruction, we extract a total of 2 million points and use a depth level of 9 for all methods, unless otherwise stated. All meshes are extracted using back-projection of depth and/or normal maps besides Neusfacto and MonoSDF (the Marching cubes) and 2DGS (TSDF). More settings can be seen in the supplementary materials.

		Sensor Depth	Algorithm	Accuracy $\downarrow$	Completion $\downarrow$	Chamfer- $L_{1}$ $\downarrow$	Normal Consistency $\uparrow$	F-score $\uparrow$
NeRF	Nerfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson	.0430	.0578	.0504	.7822	.7212
NeRF	Depth-Nerfacto [45]	$\checkmark$	Poisson	.0447	.0557	.0502	.7614	.6966
SDF	MonoSDF [58]	$\checkmark$	Marching-Cubes	.0310	.0190	.0250	.8846	.9211
Gaussian	Splatfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson	.0749	.0555	.0652	.7727	.5835
	SuGaR [12]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson+IBR	.0656	.0583	.0620	.8031	.6378
	2DGS [16]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	TSDF	.0731	.0642	.0687	.8008	.6039
	DN-Splatter (Ours)	$\checkmark$	Poisson	.0239	.0194	.0216	.8822	.9243

Table 1: Mesh evaluation: MuSHRoom. The results are averaged over 6 scenes: ”coffee_room”, ”honka”, ”kokko”, ”sauna”, ”computer”, and ”vr_room”. Incorporating depth and normal priors to 3DGS optimization significantly improves mesh reconstruction on challenging real world indoor-datasets. The best and second best results are marked with bold and underline.

		Sensor Depth	Algorithm	Accuracy $\downarrow$	Completion $\downarrow$	Chamfer- $L_{1}$ $\downarrow$	Normal Consistency $\uparrow$	F-score $\uparrow$
NeRF	Nerfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson	.1305	.1484	.1394	.7153	.4698
NeRF	Depth-Nerfacto [45]	$\checkmark$	Poisson	.0731	.1647	.1189	.6848	.5018
SDF	Neusfacto [57]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Marching-Cubes	.0736	.1945	.1340	.7159	.4605
SDF	MonoSDF [58]	$\checkmark$	Marching-Cubes	.0303	.0573	.0438	.8881	.8577
Gaussian	Splatfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson	.1934	.1503	.1719	.6741	.1790
	SuGaR [12]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	Poisson + IBR	.0940	.1011	.0975	.7241	.4367
	2DGS [16]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	TSDF	.1272	.0798	.1035	.7799	.4196
	DN-Splatter (Ours)	$\checkmark$	Poisson	.0487	.0873	.0680	.8404	.6941

Table 2: Mesh evaluation: ScanNet++. The results are averaged over the ”b20a261fdf” and ”8b5caf3398” scenes of the dataset. The best and second best results are marked with bold and underline.

5.1 Mesh evaluation

We demonstrate the effectiveness of the proposed depth and normal regularization strategy on scene geometry by extracting meshes directly after optimization, without any additional refinement steps. In Table 1 and Table 2 we show how incorporating geometric cues enables competitive mesh extraction on scenes from the MuSHRoom [38] and ScanNet++ [56] datasets. We first observe that without the usage of monocular depth cues, neither NeRF- [45] nor Gaussian-based methods [16] work well for indoor reconstruction. Improving depth and normal estimation over the baseline Splatfacto [45] makes our method competitive even against the more computationally expensive SDF approaches [58, 57]. NeRF-based methods fail to consistently achieve similar results, with depth supervision sometimes hindering mesh performance. Fig. 3 provides a qualitative comparison of our mesh extraction approach with other NeRF and 3DGS baselines on the ScanNet++ dataset.

5.2 Novel view synthesis and depth estimation

	Sensor Depth	Abs Rel $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	$\delta<1.25$ $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
Nerfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	.0862 / .0747	.0293 / .0141	.0794 / .0667	.9335 / .9428	20.86 / 20.66	.7859 / .7633	.2321 / .2702
Depth-Nerfacto [45]	$\checkmark$	.0727 / .0563	.0155 / .0283	.0583 / .2840	.9469 / .9389	21.24 / 18.93	.7832 / .7023	.2414 / .3978
MonoSDF [58]	$\checkmark$	.0555 / .0911	.0263 / .0804	.2493 / .4535	.9353 / .8825	20.68 / 20.16	.7357 / .7653	.3590 / .2261
2DGS [16]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	.0864 / .0923	.0612 / .0583	.3799 / .3132	.8820 / .8927	22.52 / 21.73	.8185 / .7898	.1773 / .1911
Splatfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	.0787 / .0817	.0364 / .0521	.2407 / .2941	.9072 / .9061	24.44 / 21.33	.8486 / .7821	.1387 / .2240
Splatfacto + $\mathcal{L}_{\hat{D}}$	$\checkmark$	.0234 / .0364	.0092 / .0145	.1293 / .1486	.9849 / .9745	24.77 / 21.95	.8538 / .7948	.1238 / .1852
Splatfacto + $\mathcal{L}_{\hat{D}}$ + $\mathcal{L}_{\text{normal}}$	$\checkmark$	.0241 / .0340	.0094 / .0123	.1308 / .1472	.9848 / .9777	24.67 / 21.99	.8517 / .7941	.1275 / .1864
DN-Splatter (ours)	$\checkmark$	.0228 / .0354	.0089 / .0214	.1280 / .2032	.9854 / .9683	24.58 / 21.89	.8558 / .7984	.1293 / .1797

Table 3: Depth estimation and novel view synthesis: MuSHRoom. The reported results are reported for two distinct evaluation datasets: left/right where left is a test set obtained by sampling uniformly every 10 frames within the training sequence and right is a test split obtained from a different camera trajectory with no overlap with the training sequence. Results are averaged over 5 scenes.

We show an extensive study on depth and normal supervision and its effect on novel view synthesis and depth metrics on challenging real-world scenes from MuSHRoom and ScanNet++ datasets. Table 3 demonstrates that incorporating sensor depth supervision into 3DGS enhances depth and RGB metrics in comparison to scenarios without geometric supervision. We qualitatively highlight the improvements in novel view synthesis and depth quality with the reduction of floaters and artefacts in Fig. 4.

Method	C- $L_{1}$ $\downarrow$	NC $\uparrow$	F1 $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	Time (min)
No Cues ( $\mathcal{L}_{\text{rgb}}$ ) [19, 45]	6.99	78.0	56.3	26.6	.870	12.9
+ Normal ( $\mathcal{L}_{\hat{N}}$ )	7.7	81.3	60.2	26.7	.874	17.2
+ Normal ( $\mathcal{L}_{\hat{N}}$ + $\mathcal{L}_{\text{smooth}}$ )	6.7	82.6	58.9	26.4	.866	17.3
+ Depth ( $\mathcal{L}_{\hat{D}}$ )	2.6	86.8	86.6	27.1	.885	13.1
+ Both ( $\mathcal{L}_{\hat{D}}$ + $\mathcal{L}_{\hat{N}}$ )	2.4	90.0	87.7	26.9	.883	17.5
+ Both ( $\mathcal{L}_{\hat{D}}$ + $\mathcal{L}_{\hat{N}}$ + $\mathcal{L}_{\text{smooth}}$ )	2.5	89.6	87.1	26.8	.879	17.9

Table 4: Ablation of geometric supervision. Geometric cues significantly improve reconstruction quality. We report mesh, novel view, and training time metrics (Nvidia 4090 GPU, 30k iterations) on the ”VR room” sequence of the MuSHRoom dataset [38].

[Uncaptioned image] — Table 5: Ablation of depth losses: ScanNet++. We compare various depth supervision strategies. We observe that the proposed gradient aware $\mathcal{L_{\hat{D}}}$ regularizer obtains the best qualitative results, mitigating uncertainties at edges from the raw iPhone depth captures. Zoom in to see the details.

5.3 Ablation studies

Regularization strategy. We evaluate various design choices of the proposed method in Table 4. The normal loss $\mathcal{L}_{\hat{N}}$ (Eq. 9) helps align Gaussians along the scene geometry leading to a better scene geometry representation. The depth loss $\mathcal{L}_{\hat{D}}$ (Eq. 4) significantly improves reconstruction quality and novel-view synthesis in ambiguous, textureless regions which are common in indoor scenes. The normal smoothing prior $\mathcal{L}_{\text{smooth}}$ (Eq. 10) enhances normal-completeness for Poisson meshing with a minimal impact on other metrics. Although the smoothing prior’s effect on quantitative metrics is minimal, its importance is evident in the qualitative renders illustrated in Fig. 5.

Depth supervision and losses. We analyze the impact of the proposed depth loss on the ScanNet++ dataset, providing both qualitative and quantitative results in Table 5. We compare several common depth losses: $\mathcal{L}_{\text{MSE}}$ , $\mathcal{L}_{\mathrm{1}}$ , $\mathcal{L}_{\mathrm{LogL1}}$ [14] and our proposed edge-aware $\mathcal{L}_{\hat{D}}$ loss, examining their effect on depth and novel view synthesis metrics. Further comparisons of depth loss performance against ground truth lidar scans and definitions can be found in the supplementary material.

We observe that $\mathcal{L}_{\mathrm{1}}$ and $\mathcal{L}_{\mathrm{LogL1}}$ generally perform the best on color metrics, with the gradient-based logarithmic variant providing the smoothest qualitative reconstructions.

In addition, in Table 6 we evaluate our regularization strategy against other alternatives. We demonstrate that mesh reconstruction using current state-of-the-art monocular [3, 15] or multi-view [59] depth networks is far less accurate compared to reconstruction using low-resolution iPhone sensor depths. Monocular and multi-view depth estimates are aligned using the strategy outlined in Section 4.1 and Eq. 5, and mesh metrics are computed on the ScanNet++ dataset. We also investigate the patch-based Pearson Correlation loss [52] to examine relative depth supervision with monocular depth [15] estimates. Although the method exceeds naive depth supervision with monocular estimates, the reconstruction quality remains inferior compared to using iPhone depths.

	Acc. $\downarrow$	Comp. $\downarrow$	C- $L_{1}$ $\downarrow$	NC $\uparrow$	F-score $\uparrow$
No supervision	.2627	.2091	.2359	.6511	.1343
Monodepth: Zoe-Depth [3]	.1751	.2084	.1918	.7420	.1455
Monodepth: Metric3D [15]	.1798	.2079	.1938	.7358	.1439
Multi-view depth [59]	.3120	.2375	.2748	.6408	.1903
Patch-based depth [52]	.1183	.1766	.1474	.7975	.2236
Sensor depth (iPhone)	.0609	.1433	.1021	.8130	.5833

Table 6: Ablation of depth supervision. We compare monocular [3, 15], multi-view [15], relative (patch-based) [52], and sensor depth supervision strategies on the ”b20a261fdf” scene of ScanNet++.

Despite the low-resolution sensor depths lacking detail and containing inaccuracies (mostly at object edges), they remain practical for use in real-world indoor scenes. Future research is needed to bridge the performance gap for monocular depth supervision.

Mesh extraction techniques. Lastly, we investigate various Poisson meshing techniques. In Table 7, we demonstrate that extracting oriented point sets from optimized depth and normal maps results in smoother and more realistic reconstructions compared to other methods. We report mesh evaluation metrics for these different techniques. We compare several approaches: directly using trained Gaussian means and normals (total of 512k Gaussians); extraction of surface density at levels 0.1 and 0.5, as proposed in SuGaR [12]; back-projection of optimized depth and normal maps, as explained in Section 4.4. All models were trained with our depth and normal regularization. To ensure a fair comparison, we set the total number of extracted points to 500k for both the surface density and back-projection methods.

6 Conclusion

We presented DN-Splatter, a method for depth and normal regularization of 3D Gaussian splatting. This simple yet effective strategy enhances photorealism by improving common novel view synthesis metrics and significantly improving depth estimation and surface quality extracted from the Gaussian scene. We demonstrated that prior regularization is essential for achieving more geometrically valid and consistent reconstructions in challenging indoor scenes. Although we improved over the state-of-the-art methods, our focus was limited to densely captured scenes with relatively still cameras. Future work could address more challenging and sparser data captures, which often suffer from motion blur and other capture artifacts. Additionally, better meshing techniques are needed to optimize Gaussian scene parameters and mesh quality concurrently, as Poisson surface reconstruction is more sensitive to the estimated positions of the points than to their normal estimates. Furthermore, more research is needed to bridge the performance gap between monocular depth estimation and real sensor depth supervision. This remains an area for future work.

References

[1] Gwangbin Bae and Andrew J. Davison. Rethinking inductive biases for surface normal estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[2] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields, 2022.
[3] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
[4] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2023.
[5] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. arXiv preprint arXiv:2311.13398, 2023.
[6] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised NeRF: Fewer views and faster training for free. In CVPR, June 2022.
[7] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
[8] Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, and Yao Yao. Relightable 3d gaussian: Real-time point cloud relighting with brdf decomposition and ray tracing. arXiv:2311.16043, 2023.
[9] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for community photo collections. In ICCV, pages 1–8, 2007.
[10] Guangcong, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. ICCV, 2023.
[11] Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang, Guofeng Zhang, Hujun Bao, and Xiaowei Zhou. Neural 3d scene reconstruction with the manhattan-world assumption. In CVPR, 2022.
[12] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering, 2023.
[13] Wilfried Hartmann, Silvano Galliani, Michal Havlena, Luc Van Gool, and Konrad Schindler. Learned multi-patch similarity. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
[14] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Dec 2018.
[15] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
[16] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. In SIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024.
[17] James T. Kajiya. The rendering equation. In Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’86, page 143–150, New York, NY, USA, 1986. Association for Computing Machinery.
[18] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson Surface Reconstruction. In Alla Sheffer and Konrad Polthier, editors, Symposium on Geometry Processing. The Eurographics Association, 2006.
[19] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 42(4), 2023.
[20] Leonid Keselman and Martial Hebert. Flexible techniques for differentiable rendering with 3d gaussians. In ICCV, 2023.
[21] Georgios Kopanas, Julien Philip, Thomas Leimkühler, and George Drettakis. Point-based neural rendering with per-view optimization. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering), 40(4), June 2021.
[22] Elena Kosheleva, Sunil Jaiswal, Faranak Shamsafar, Noshaba Cheema, Klaus Illgner-Fehns, and Philipp Slusallek. Edge-aware consistent stereo video depth estimation. arXiv preprint arXiv:2305.02645, 2023.
[23] Uday Kusupati, Shuo Cheng, Rui Chen, and Hao Su. Normal assisted stereo depth estimation. In CVPR, pages 2189–2199, 2020.
[24] Yixing Lao, Xiaogang Xu, Zhipeng Cai, Xihui Liu, and Hengshuang Zhao. CorresNeRF: Image correspondence priors for neural radiance fields. In NeurIPS, 2023.
[25] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In CVPR, 2023.
[26] Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, and Kui Jia. Gs-ir: 3d gaussian splatting for inverse rendering, 2023.
[27] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, page 163–169, New York, NY, USA, 1987. Association for Computing Machinery.
[28] Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[29] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM TOG, 39(4):71–1, 2020.
[30] I. Melekhov, J. Kannala, and E. Rahtu. Image patch matching using convolutional descriptors with euclidean distance. In Proc. ACCVW, 2016.
[31] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[32] Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenović, and Jiři Matas. Working hard to know your neighbor’s margins: local descriptor learning loss. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4829–4840. Curran Associates Inc., 2017.
[33] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In ECCV, pages 414–431. Springer, 2020.
[34] Michael Niemeyer, Jonathan T. Barron, Ben Mildenhall, Mehdi S. M. Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In CVPR, 2021.
[35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035. Curran Associates, Inc., 2019.
[36] Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, and Markus Steinberger. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering, 2024.
[37] Marie-Julie Rakotosaona, Fabian Manhardt, Diego Martin Arroyo, Michael Niemeyer, Abhijit Kundu, and Federico Tombari. Nerfmeshing: Distilling neural radiance fields into geometrically-accurate 3d meshes. In Proc. of the International Conf. on 3D Vision (3DV), 2024.
[38] Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, and Esa Rahtu. Mushroom: Multi-sensor hybrid room dataset for joint 3d reconstruction and novel view synthesis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4508–4517, 2024.
[39] Darius Rückert, Linus Franke, and Marc Stamminger. Adop: Approximate differentiable one-pixel point rendering. ACM TOG, 41(4):1–14, 2022.
[40] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
[41] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
[42] Ayan Sinha, Zak Murez, James Bartolozzi, Vijay Badrinarayanan, and Andrew Rabinovich. Deltas: Depth estimation by learning triangulation and densification of sparse points. In ECCV, pages 104–121. Springer, 2020.
[43] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. Association for Computing Machinery (ACM), 2006.
[44] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael Goesele, Steven Lovegrove, and Richard Newcombe. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
[45] Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, 2023.
[46] Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, and Gang Zeng. Delicate textured mesh recovery from nerf via adaptive surface refinement. arXiv preprint arXiv:2303.02091, 2022.
[47] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
[48] Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. CVPR, 2022.
[49] Jingwen Wang, Tymoteusz Bleja, and Lourdes Agapito. Go-surf: Neural feature grid optimization for fast, high-fidelity rgb-d surface reconstruction. In 2022 International Conference on 3D Vision (3DV), pages 433–442. IEEE, 2022.
[50] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
[51] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV, pages 5610–5619, 2021.
[52] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time 360° sparse view synthesis using gaussian splatting. Arxiv, 2023.
[53] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv:2401.10891, 2024.
[54] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In NeurIPS, 2021.
[55] Vickie Ye, Matias Turkulainen, and the Nerfstudio team. gsplat.
[56] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In ICCV, 2023.
[57] Zehao Yu, Anpei Chen, Bozidar Antic, Songyou Peng, Apratim Bhattacharyya, Michael Niemeyer, Siyu Tang, Torsten Sattler, and Andreas Geiger. Sdfstudio: A unified framework for surface reconstruction, 2022.
[58] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS, 2022.
[59] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. British Machine Vision Conference (BMVC), 2020.
[60] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
[61] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, pages 1851–1858, 2017.

Supplementary Material

Matias Turkulainen ${}^{*}{}^{1}$ Xuqian Ren ${}^{*}{}^{2}$ Iaroslav Melekhov³ Otto Seiskari⁴ Esa Rahtu² Juho Kannala^3,4
¹ ETH Zurich, ² Tampere University, ³ Aalto University, ⁴ Spectacular AI
Corresponding author: [email protected]

In this supplementary material, we provide further details regarding our baseline methods and datasets in Appendix A, definitions for our evaluation metrics and losses in Appendix B, and further quantitative and qualitative results in Appendix C and Appendix D respectively.

Appendix A Implementation details

A.1 Baselines

We compare a variety of baseline methods for novel view synthesis, depth estimation, and mesh reconstruction.

Nerfacto. We use the Nerfacto model from Nerfstudio [45] version 1.0.2 in our experiments. We use default settings, disable pose optimization, and predict normals using the proposed method from Ref-NeRF [48]. We use rendered normal and depth maps for Poisson surface reconstruction.

Depth-Nerfacto. We use the depth supervised variant of Nerfacto with a direct loss on ray termination distribution for sensor depth supervision as described in DS-NeRF [6]. Besides this, we use the same settings as for Nerfacto.

Neusfacto. We use default settings provided by Neusfacto from SDFStudio [57] and use the default marching cubes algorithm for meshing.

MonoSDF. We use the recommended settings from MonoSDF [58] and with sensor depth and monocular normal supervision. We set the sensor depth loss multiplier to 0.1 and normal loss multiplier to 0.05. Normal predictions are obtained from omnidata [7].

Splatfacto. The Splatfacto model from Nerfstudio version 1.1.3 and gsplat [55] version 1.0.0 serves as our baseline 3DGS model. This is a faithful re-implementation of the original 3DGS work [19]. We keep all the default settings for the baseline comparison.

SuGaR. We use the official SuGaR [12] source-code. The original code-base, written as an extension to the original 3DGS work [19], supports only COLMAP based datasets (that is, datasets containing a COLMAP database file). We made slight modifications to the original source-code to support non-COLMAP based formats to import camera information and poses directly from a pre-made .json files. We use default settings for training as described in [12]. We use the SDF trained variant in all experiments. We extract both the coarse and refined meshes for evaluation, although the difference in geometry metrics are small between them. We found a small inconsistency in SuGaR’s normal directions for outward facing indoor datasets, which we corrected in our experiments.

2DGS. We use the official 2DGS [16] source-code. Similar to our SuGaR implementation, we made slight modifications to the original source-code to support non-COLMAP based formats to import camera information and poses directly from a pre-made .json files. We use default settings for training as described in [16] and the default meshing strategy using TSDF fusion.

A.2 Datasets

MuSHRoom. We use the official train and evaluation splits from the MuSHRoom [38] dataset. We report evaluation metrics on a) images obtained from uniformly sampling every 10 frames from the training camera trajectory and b) images obtained from a different camera trajecotry. We use the globally optimized COLMAP [40] for both evaluation sequences. We use a total of 5 million points for mesh extraction for Poisson surface reconstruction.

ScanNet++. We use the ”b20a261fdf” and ”8b5caf3398” scenes in our experiments. We use the iPhone sequences with COLMAP registered poses. The sequences contain 358 and 705 registered images respectively. We uniformly load every 5th frame from the sequences from which we reserve every 10th frame for evaluation.

Appendix B Definitions for metrics and losses

B.1 Depth evaluation metrics

For the ScanNet++ and MuSHRoom datasets, we follow [51, 23, 29, 42, 33, 47, 61] and report depth evaluation metrics, defined in Table 8. We use the Absolute Relative Distance (Abs Rel), Squared Relative Distance (Sq Rel), Root Mean Squared Error RMSE and its logarithmic variant RMSE log, and the Threshold Accuracy $(\delta<t)$ metrics. The Abs Rel metric provides a measure of the average magnitude of the relative error between the predicted depth values and the ground truth depth values. Unlike the Abs Rel metric, the Sq Rel considers the squared relative error between the predicted and ground truth depth values. The RMSE metric calculates the square root of the average of the squared differences between the predicted and the ground-truth values, giving a measure of the magnitude of the error made by the predictions. The RMSE log metric is similar to RMSE but applied in the logarithmic domain, which can be particularly useful for very large depth values. The Threshold accuracy measures the percentage of predicted depth values within a certain threshold factor, $\delta$ of the ground-truth depth values.

Metric	Definition
Abs Rel	$\frac{1}{N}\sum_{i=1}^{N}\frac{\left\|d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right\|}{d_{i}^{\text{gt}}}$
Sq Rel	$\frac{1}{N}\sum_{i=1}^{N}\frac{\left(d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right)^{2}}{d_{i}^{\text{gt}}}$
RMSE	$\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(d_{i}^{\text{pred}}-d_{i}^{\text{gt}}% \right)^{2}}$
RMSE log	$\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(\log d_{i}^{\text{pred}}-\log d_{i}^{% \text{gt}}\right)^{2}}$
Threshold accuracy, $\delta$	$\frac{1}{N}\sum_{i=1}^{N}\left[\max\left(\frac{d_{i}^{\text{pred}}}{d_{i}^{% \text{gt}}},\frac{d_{i}^{\text{gt}}}{d_{i}^{\text{pred}}}\right)<\delta\right]$

Table 8: Depth Evaluation Metrics. We show definitions for our depth evaluation metrics.

d_{i}^{\text{pred}}

and

d_{i}^{\text{gt}}

are predicted and ground-truth depths for the

i

-th pixel.

\delta

is the threshold factor (e.g.,

\delta<1.25

\delta<1.25^{2}

\delta<1.25^{3}

B.2 Mesh evaluation metrics

In Table 9 we provide the definitions for mesh evaluation used throughout the text for comparing predicted and ground truth meshes. We use a threshold of $5cm$ for precision, recall, and F-scores. Furthermore, we evaluate mesh quality only within the visibility of the training camera views.

Metric	Definition
Accuracy	$\frac{1}{\|P\|}\sum_{\mathbf{p}\in P}\left(\min_{\mathbf{p}^{}\in P^{}}\left\\|% \mathbf{p}-\mathbf{p}^{*}\right\\|_{1}\right)$
Completion	$\frac{1}{\|P^{}\|}\sum_{\mathbf{p}^{}\in P^{}}\left(\min_{\mathbf{p}\in P}% \left\\|\mathbf{p}-\mathbf{p}^{}\right\\|_{1}\right)$
Chamfer- $L_{1}$	$\frac{\text{ Accuracy + Completion }}{2}$
Normal Completion	$\frac{1}{\|P^{}\|}\sum_{\mathbf{p}^{}\in P^{}}\left(\mathbf{n}_{\mathbf{p}}^{% T}\mathbf{n}_{\mathbf{p}^{}}\right)$ s.t. $\mathbf{p}=\underset{p\in P}{\text{argmin}}\left\\|\mathbf{p}-\mathbf{p}^{*}% \right\\|_{1}$
Normal-Consistency	$\frac{\text{ Normal-Acc+Normal-Comp }}{2}$
Precision	$\frac{1}{\|P\|}\sum_{\mathbf{p}\in P}\left(\min_{\mathbf{p}^{}\in P^{}}\left\\|% \mathbf{p}-\mathbf{p}^{*}\right\\|_{1}<5cm\right)$
Recall	$\frac{1}{\|P^{}\|}\sum_{\mathbf{p}^{}\in P^{}}\left(\min_{\mathbf{p}\in P}% \left\\|\mathbf{p}-\mathbf{p}^{}\right\\|_{1}<5cm\right)$
F-score	$\frac{2\cdot\text{ Precision }\cdot\text{ Recall }}{\text{ Precision }+\text{ % Recall }}$

Table 9: Mesh Evaluation Metrics.

P

and

P^{*}

are the point clouds sampled from the predicted and the ground truth mesh.

n_{p}

is the normal vector at point

\mathbf{p}

B.3 Depth losses

For depth supervision, we compare the following variants of loss functions defined in Table 10

Loss	Definition
$\mathcal{L}_{\text{MSE}}$	$\frac{1}{\|\hat{D}\|}\sum(\hat{D}-D)^{2}$
$\mathcal{L}_{\mathrm{1}}$	$\frac{1}{\|\hat{D}\|}\sum\\|\hat{D}-D\\|_{1}$
$\mathcal{L}_{\text{LogL1}}$	$\frac{1}{\|\hat{D}\|}\sum\log(1+\\|\hat{D}-D\\|_{1})$
$\mathcal{L}_{\text{HuberL1}}$	$\begin{cases}\\|D-\hat{D}\\|_{1},&\text{ if }\\|D-\hat{D}\\|_{1}\leq\delta,\\ \frac{(D-\hat{D})^{2}+\delta^{2}}{2\delta},&\text{ otherwise. }\end{cases}$
$\mathcal{L}_{\text{DSSIML1}}$	$\alpha\frac{1-\operatorname{SSIM}(I,\hat{I})}{2}+(1-\alpha)\|I-\hat{I}\|$
$\mathcal{L}_{\text{EAS}}$	$g_{\text{rgb}}\frac{1}{\|\hat{D}\|}\sum\\|\hat{D}-D\\|_{1}$
$\mathcal{L}_{\hat{D}}$	$g_{\text{rgb}}\frac{1}{\|\hat{D}\|}\sum\log(1+\\|\hat{D}-D\\|_{1})$

Table 10: Depth Regularization Objectives. We show the definitions for various depth objectives. Here,

\delta=0.2\max(\|D-\hat{D}\|_{1})

g_{\text{rgb}}=\text{exp}(-\nabla I)

D

\hat{D}

are the ground truth and rendered depths, and

I

\hat{I}

is the ground truth/rendered RGB image.

We compare the performance of these losses as supervision in Table 13.

		(a) Test within a sequence					(b) Test with a different sequence
	Sensor Depth	Abs Rel $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<1.25$ $\uparrow$	Abs Rel $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<12.5$ $\uparrow$
Nerfacto [45]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	14.72	19.79	61.05	13.26	88.25	14.52	18.32	63.85	13.13	88.41
Depth-Nerfacto [45]	$\checkmark$	13.90	11.71	50.21	12.98	88.46	13.49	10.76	51.63	12.62	89.23
MonoSDF [58]	$\checkmark$	10.90	9.87	48.74	11.27	83.48	11.00	10.98	50.92	11.37	82.62
Splatfacto (no cues) [19]	${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}$	8.32	5.45	38.47	10.23	89.75	8.06	5.39	38.61	10.05	90.51
Splatfacto + $\mathcal{L}_{\hat{D}}$ (Ours)	$\checkmark$	3.71	3.08	30.80	4.27	95.52	3.78	3.08	31.35	4.26	95.47
Splatfacto + $\mathcal{L}_{\hat{D}}$ + $\mathcal{L}_{\hat{N}}$ (Ours)	$\checkmark$	3.64	3.02	30.33	4.17	95.60	3.69	2.97	30.57	4.15	95.64

Table 11: Depth evaluation metrics compared to ground truth Faro scanner data for the MuSHRoom dataset. Instead of evaluating on noisy captured iPhone depth maps for evaluation, we rely on more accurate depth maps reconstructed from a Faro lidar scanner. We show that our depth regularization strategy, utilizing low-resolution iPhone depths, greatly outperforms other baselines. Results are averaged over 10 scenes.

(a) We load every 3/5/8/12 views from the whole training sequence (around 260). Results are evaluated on ”Courtroom” from Tanks & Temples.

Methods	load every 3			load every 5			load every 8			load every 12
Methods	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
Splatfacto	20.68	.7445	.1921	18.50	.6991	.2110	16.86	.6459	.2474	14.76	.5580	.3332
Ours + Zoe-Depth [3]	20.88	.7518	.1833	19.58	.7118	.2007	17.60	.6568	.2433	15.90	.5835	.2971
Ours + DepthAnything [53]	20.91	.7528	.1830	19.60	.7153	.1997	17.44	.6568	.2456	16.24	.5902	.2924

(b) We load every 5/8/12/20 views from the whole training sequence (around 270). Results are evaluated on ”8b5caf3398” from ScanNet++ DSLR sequence.

Methods	load every 5			load every 8			load every 12			load every 20
Methods	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
Splatfacto	24.68	.8810	.1169	22.81	.8568	.1559	21.08	.8357	.1816	18.90	.8059	.2375
Ours + Zoe-Depth [3]	24.72	.8821	.1163	23.04	.8591	.1521	21.81	.8415	.1755	19.10	.8059	.2332
Ours + DepthAnything [53]	24.66	.8826	.1194	23.21	.8595	.1507	21.76	.8406	.1751	19.51	.8101	.2321

Table 12: Comparison of DN-Splatter performance with monocular depth supervision. We ablate the Zoe-Depth[3] and DepthAnything[53] monocular estimators with sparse views on the ”Courtroom” sequence of Tanks & Temples advanced dataset. Monocular depth supervision aids in novel-view synthesis under sparse settings.

Appendix C Additional quantitative results

Here we provide additional quantitative results for DN-Splatter. In Table 11, we show the depth evaluation performance of our proposed regularization scheme on the MuSHRoom dataset, evaluated against ground truth Faro lidar scanner data instead of the low-resolution iPhone depths. This corresponds to Table 3 from the main paper, which compares depth metrics on iPhone depth captures for the same scenes and baselines. When comparing to laser scanner depths, our method still out performs other baseline methods on depth estimation.

We also consider the performance of DN-Splatter supervised by only monocular depth estimates on a large scale Tanks & Temples scene in Table 12. We consider training with dense and sparse captures and conclude that although monocular depth supervision in dense captures provides minimal improvements, the increase in novel view synthesis under sparse settings is notable.

Lastly, in Table 13 we compare the performance of various depth losses described in Section B.3 on depth estimation and novel view synthesis. There are several interesting observations. First, the logarithmic depth loss $L_{\text{LogL1}}$ outperforms other popular variants like $L_{\text{L1}}$ oder $L_{\text{MSE}}$ on depth and RGB synthesis. Second, the gradient-aware logarithmic depth variant $L_{\hat{D}}$ outperforms the simpler variant, validating our assumption that captured sensor depths, like those from iPhone cameras, tend to contain noise and inaccuracies at edges or sharp boundaries. Therefore, the gradient-aware variant mitigates these inaccurate sensor readings.

(a) Test split obtained by sampling uniformly every 10 frames within the training sequence.

	Depth estimation					Novel view synthesis
	Abs Rel $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<1.25$ $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
$\mathcal{L}_{\text{MSE}}$	.0587	.0229	.2313	.0618	.9534	22.32	.7995	.1653
$\mathcal{L}_{\mathrm{1}}$	.0419	.0233	.2286	.0435	.9629	22.46	.8041	.1594
$\mathcal{L}_{\text{DSSIML1}}$	.0476	.0331	.2773	.0523	.9476	21.77	.7802	.1879
$\mathcal{L}_{\text{LogL1}}$	.0430	.0267	.2414	.0444	.9609	22.48	.8053	.1580
$\mathcal{L}_{\text{HuberL1}}$	.0536	.0239	.2335	.0561	.9579	22.39	.8017	.1625
$\mathcal{L}_{\text{EAS}}$	.0954	.0572	.3581	.1103	.8726	22.18	.7951	.1780
$\mathcal{L}_{\hat{D}}$ (Ours)	.0338	.0212	.2170	.0350	.9691	22.49	.8031	.1630

(b) Test split obtained from a different camera trajectory with no overlap with the training sequence.

	Depth estimation					Novel view synthesis
	Abs Rel $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<1.25$ $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
$\mathcal{L}_{\text{MSE}}$	.0572	.0282	.2506	.0570	.9585	19.37	.7088	.2329
$\mathcal{L}_{\mathrm{1}}$	.0449	.0248	.2364	.0449	.9639	19.45	.7164	.2253
$\mathcal{L}_{\text{DSSIML1}}$	.0482	.0330	.2775	.0527	.9495	18.98	.7040	.2430
$\mathcal{L}_{\text{LogL1}}$	.0451	.0269	.2454	.0453	.9629	19.50	.7183	.2228
$\mathcal{L}_{\text{HuberL1}}$	.0526	.0267	.2483	.0533	.9617	19.45	.7128	.2285
$\mathcal{L}_{\text{EAS}}$	.0724	.0442	.3142	.0819	.9329	19.30	.7108	.2351
$\mathcal{L}_{\hat{D}}$ (Ours)	.0427	.0252	.2335	.0420	.9632	19.53	.7187	.2286

Table 13: Ablation on depth losses on the MuSHRoom dataset. We consider various depth losses as defined in Section B.3 and their impact on depth estimation and novel view synthesis. We achieve the best performance with our proposed edge-aware

\mathcal{L}_{\hat{D}}

loss.

Appendix D Additional qualitative results

Lastly, we provide additional qualitative results for mesh performance in Fig. 6 as well as depth and novel view renders in Fig. 7, Fig. 8, and Fig. 9, respectively.