11institutetext: Stanford University

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Haodi He\orcidlink0009-0001-5534-7795    Colton Stearns\orcidlink0000-0002-3297-2870    Adam W. Harley\orcidlink0000-0002-9851-4645    Leonidas J. Guibas\orcidlink0000-0002-8315-4886
Abstract

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of “coarse" or “fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model’s 3D hierarchical segmentations in real world scenes.111The code and dataset are available at: https://github.com/hardyho/ultrametric_feature_fields

1 Introduction

Different applications often need different semantic understandings of a scene. This fact necessitates that segmentation methods offer a diverse set of predictions that span different modalities, showcase multiple levels of granularity, and offer hiearchical relationships. With the advent of the “Segment Anything Model” (SAM) [24], reliable multi-granular single-view segmentation might be described as accomplished. Yet, in a multi-view or moving-camera system, multi-granular segmentation of each image can produce an overwhelming total number of segments, many of which disagree with each other and most of which are not useful for the downstream application of interest. In this work, we attempt to distill these thousands of 2D segmentation options (which may be conflicting) into an organized 3D segmentation which is view-consistent and hierarchical.

Refer to caption
Figure 1: Our method takes as input multi-view posed images, paired with segmentation masks from the recent “Segment Anything Model” (SAM), and merges these into a coherent 3D representation where segmentation is view-consistent and hierarchical.

As demonstrated in many recent works, view-consistency can be achieved by distilling the segmentation information into a 3D implicit field, such as a Neural Radiance Field (NeRF) [33]. The first works in this area simply optimized a semantic labelling branch alongside the color branch of the NeRF [63, 32, 27]. This delivers view-consistent semantic segmentation, but does not generalize to unseen categories. More recent methods have proposed to distill generic image features into the implicit field [26, 21], allowing these features to be queried in 2D or 3D for segmentation or other tasks. However, these methods often struggle to capture precise segmentation boundaries, perhaps because language-aligned features are region-based and not pixel-based. Furthermore, most of these works do not directly account for multiple levels of granularity, and those that do [21] must re-render the feature field for every scale of interest. Nonetheless, the expressiveness and wide adoption of NeRFs makes them a natural choice as underlying 3D representation, and we adopt this same choice here.

Unlike prior work, we explicitly aim for our scene segmentation to be hierarchical. This means that the scene segmentation has a tree structure, where the root group is the full scene, and any group can be recursively divided into smaller groups, all the way down to the point level. A group is defined as a spatially connected neighborhood where all pairwise feature distances are within some threshold. At first glance, this is a familiar contrastive learning problem: within-segment feature pairs should have small distances, and cross-segment feature pairs should have large distances, so that thresholding yields groups that follow segmentation boundaries. However, we demonstrate this typical setup is not sufficient to create a consistent hierarchy, and we find it is crucial to use ultrametric distances, rather than standard Euclidean distances, in the contrastive loss. In an ultrametric space, for any three points x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z, distances satisfy a condition stronger than the standard triangle inequality d(x,z)d(x,y)+d(y,z)𝑑𝑥𝑧𝑑𝑥𝑦𝑑𝑦𝑧d(x,z)\leq d(x,y)+d(y,z)italic_d ( italic_x , italic_z ) ≤ italic_d ( italic_x , italic_y ) + italic_d ( italic_y , italic_z ), namely that d(x,z)max{d(x,y),d(y,z)}𝑑𝑥𝑧𝑑𝑥𝑦𝑑𝑦𝑧d(x,z)\leq\max\{d(x,y),d(y,z)\}italic_d ( italic_x , italic_z ) ≤ roman_max { italic_d ( italic_x , italic_y ) , italic_d ( italic_y , italic_z ) }. Ultrametric spaces are ideally suited for hierarchical clustering because distance-based groupings are transitive: if d(x,y)ϵ𝑑𝑥𝑦italic-ϵd(x,y)\leq\epsilonitalic_d ( italic_x , italic_y ) ≤ italic_ϵ and d(y,z)ϵ𝑑𝑦𝑧italic-ϵd(y,z)\leq\epsilonitalic_d ( italic_y , italic_z ) ≤ italic_ϵ, then it follows that d(x,z)ϵ𝑑𝑥𝑧italic-ϵd(x,z)\leq\epsilonitalic_d ( italic_x , italic_z ) ≤ italic_ϵ. In other words, if we group x𝑥xitalic_x and y𝑦yitalic_y together and y𝑦yitalic_y and z𝑧zitalic_z together, then we automatically also group x𝑥xitalic_x and z𝑧zitalic_z together. Thus, as ϵitalic-ϵ\epsilonitalic_ϵ grows, smaller clusters are naturally merged into larger clusters, automatically giving rise to a hierarchy.

We optimize our 3D ultrametric feature fields by rendering them to 2D, and using SAM as a noisy supervision signal for grouping. After optimization, we produce view-consistent hierarchical segmentations at arbitrary levels of granularity, by simply specifying a threshold ϵitalic-ϵ\epsilonitalic_ϵ, and running a Watershed transform to retrieve all groups that exist under this threshold.

We evaluate our approach on models from the PartNet dataset, showing that our method recovers a view-consistent hierarchy of segmentations that captures the natural part decomposition of a 3D object. Furthermore, we introduce a synthetic dataset with hierarchical segmentation annotations based on the NeRF Blender Dataset. Unlike PartNet, our proposed dataset offers hierarchical decomposition of more complex scenes. In all evaluations, we compare our method to a set of competitive baselines, measuring IoU accuracy and 3D consistency, and we demonstrate that our method outperforms existing open-vocabulary 3D segmentation methods, such as DFF [26], LeRF [21], and SAM-3D [7]. Additionally, we introduce a metric for measuring the quality of a hierarchy, and demonstrate that the segmentations in prior work lack hierarchical structure, while our output is hierarchical by construction.

In sum, our key contribution is a novel formulation for 3D scene segmentation: ultrametric feature fields. Using this formulation, we are able to distill view-inconsistent 2D masks into a 3D representation which is not only view-consistent but also hierarchical, allowing arbitrary-granularity segmentation at test time. We also contribute a new synthetic dataset, and propose new evaluation metrics, to quantify our progress and facilitate future work in this direction. Finally we provide qualitative examples of our model’s 3D hierarchical segmentations in real world scenes.

Refer to caption
Figure 2: Method Overview: We train a NeRF with an ultrametric feature field using images and view-inconsistent segmentation masks from SAM [24]. After training, we use the depth estimation and feature maps from training views to construct a 3D point cloud. At inference, for a specified threshold t𝑡titalic_t representing the granularity level, we apply a 3D watershed transform to segment the 3D point cloud. Then, we can query the point clouds in novel views and obtain view-consistent segmentation results.

2 Related Work

2.0.1 Neural Radiance and Feature Fields

Neural Radiance Fields (NeRFs) [33] are a popular representation for novel view synthesis and 3D reconstruction, and in the last few years there has been an explosion of research surrounding NeRFs. To name a few research directions, NeRF has been extended to improve rendering quality [3, 15, 43], accelerate training and inference [37, 30, 17, 9, 10], improve geometry [52, 57, 61, 40], and learn with fewer viewpoints [39, 42, 50]. In this work we are most interested in related research that integrates segmentation into NeRF [26, 63, 21, 55, 32, 31, 35, 59, 7, 27, 45].

One line of research aimed to learn a semantic or instance branch alongside the NeRF’s color and density, thus yielding view-consistent segmentation [63, 32, 51, 27, 45]. Semantic-NeRF [63] was a seminal work that learned a semantic field to propagate 2D segmentations into new views. Instance-NeRF [32] improved Semantic-NeRF by handling panoptic segmentation from Mask2Former [12]. Nevertheless, these methods are restricted to a closed vocabulary of instance categories, and additionally do not attempt a hierarchical understanding of the segmented instances.

Another recent direction has been to learn a generic volumetric feature field alongside NeRF. DFF [26], N3F [49], and LeRF [21] distilled 2D image features generated by off-the-shelf feature extractors such as CLIP [41], DINO [5], and LSeg [28] into feature fields that enable 3D segmentation and editing in NeRF. However, these methods often struggle to recover precise segmentation boundaries, and also do not establish a hierarchical structure on the features. In contrast, we learn a feature field that distills a hierarchy of segmentations from noisy multi-view SAM predictions.

Finally, other works [59, 7, 6] used NeRF or 3D Gaussians [20] to propagate a single SAM query into novel views and establish view-consistency. However, these methods handle one segmentation at a time, while our approach trains a feature field to jointly aggregate and reconcile hundreds of noisy 2D masks.

2.0.2 Hierarchical Segmentation

Hierarchical segmentation, a specialized domain within image segmentation, partitions an image into regions that exhibit a hiearchical tree structure, (i.e., each segmentation can be recursively divided into smaller segmentations). In the pre-deep learning era, researchers utilized non-parametric approaches to generate contours and enable hierarchical segmentation [46, 4, 38, 1, 2]. Ultrametric distance and the watershed transform were employed to identify hierarchical clusters based on RGB values [38, 4, 1], and subsequent advancements by Yarkony et al. [58] and Xu et al. [56] improved the efficiency and flexibility of ultrametric hierarchical segmentation methods. More recently, Zhao et al. [62] and Li et al. [29] shifted their focus from hierarchical segmentation within an image toward estimating the hierarchy of segmentation classes. This class hierarchy has also been of particular interest in the field of human parsing, where segmentation is performed according to the hierarchical structure of the human body [14, 53, 54, 16]. In the 3D domain, Mo et al. [36] introduced PartNet, a large scale 3D mesh dataset annotated with hierarchical segmentation, and researchers have subsequently explored hierarchical structure in 3D shapes, including multi-granularity segmentation on point clouds [48, 47, 65].

Most relevant to us, the recent Segment Anything Model [24] emphasized multi-granular segmentation in its open-vocabulary and zero-shot segmentation setting. Given its impressive performance, we use off-the-shelf SAM segmentations as supervision for our method. Furthermore, while concurrent works [23, 60] also attempt to distill SAM masks into NeRF, our ultrametric feature field constitutes a fundamentally different approach to the problem.

3 Preliminaries and Notation

In this section we describe the core building blocks of our approach: (1) implicit feature fields, (2) hierarchical segmentation via watershed transform and ultrametrics, and (3) an off-the-shelf image segmentation model.

3.0.1 NeRF and Feature Fields

A Neural Radiance Field (NeRF) [33] is a volumetric representation that outputs a density σ𝜎\sigmaitalic_σ and color c given a 3D coordinate x=(x,y,z)x𝑥𝑦𝑧\textbf{x}=(x,y,z)x = ( italic_x , italic_y , italic_z ) and 2D viewing direction d. Given a pixel’s camera ray r, the NeRF samples N𝑁Nitalic_N points along the ray x1,,xNsubscriptx1subscriptx𝑁\textbf{x}_{1},...,\textbf{x}_{N}x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT at corresponding intervals δ1,,δNsubscript𝛿1subscript𝛿𝑁\delta_{1},...,\delta_{N}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and performs approximate volume rendering to estimate the pixel’s color: C(r)=k=1NTk(1e(σkδk))ck,Crsuperscriptsubscript𝑘1𝑁subscriptTk1superscriptesubscript𝜎ksubscript𝛿ksubscriptck\textbf{C}(\textbf{r})=\sum_{k=1}^{N}\rm T_{k}(1-e^{(-\sigma_{k}\delta_{k})})% \textbf{c}_{k}\,,C ( r ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_T start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT ( 1 - roman_e start_POSTSUPERSCRIPT ( - italic_σ start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) c start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT , with Tk=ei=1kσiδisubscriptTksuperscriptesuperscriptsubscripti1ksubscript𝜎isubscript𝛿i\rm T_{k}=e^{-\sum_{i=1}^{k}\sigma_{i}\delta_{i}}roman_T start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT = roman_e start_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Recently, Distilled Feature Fields (DFF) [26] proposed to learn a volumetric feature fields within a NeRF. Given 3D coordinate x, DFF outputs a feature f in addition to the original density σ𝜎\sigmaitalic_σ and color c. Using the same volume rendering equation, DFF renders a features in addition to colors. We follow DFF and learn a NeRF with an accompanying feature field.

3.0.2 Watershed Transform and Ultrametrics

The Watershed transform [4, 38] is a traditional hierarchical segmentation method. The method interprets edge energies as a heightmap, and initiates a flooding process, wherein energy basins beneath a given threshold are merged into regions, and higher thresholds cause regions to merge. Since the “water level” is uniform across the whole space, this yields a hierarchical segmentation.

Representing an image or scene as a graph, denoted as 𝒢=(V,E)𝒢𝑉𝐸\mathcal{G}=(V,E)caligraphic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V includes all of the pixels/points and E𝐸Eitalic_E connects points which are adjacent, the minimum water level that merges two points can be expressed as:

d(vi,vj)=minpPmaxep|e|,𝑑subscript𝑣𝑖subscript𝑣𝑗subscript𝑝𝑃subscript𝑒𝑝𝑒d(v_{i},v_{j})=\min_{p\in P}\max_{e\in p}|e|\,,italic_d ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_e ∈ italic_p end_POSTSUBSCRIPT | italic_e | , (1)

where P𝑃Pitalic_P denotes all paths that connect visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the graph, and e𝑒eitalic_e is an edge on the path p𝑝pitalic_p. Computing the distance between visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT means first finding the shortest path between the nodes, where path length is determined by the maximum edge along the path, and then reporting that crucial edge length. This is sometimes called the minimax path problem. This distance is an ultrametric distance [34, 18], which satisfies a triangle inequality of the form

d(x,y)max{d(x,z),d(y,z)}.𝑑𝑥𝑦𝑑𝑥𝑧𝑑𝑦𝑧d(x,y)\leq\max\{d(x,z),d(y,z)\}\,.italic_d ( italic_x , italic_y ) ≤ roman_max { italic_d ( italic_x , italic_z ) , italic_d ( italic_y , italic_z ) } . (2)

Fig. 3 provides an illustration of paths and distances in a simple scene graph.

Refer to caption
Refer to caption
Figure 3: Ultrametric Segmentation: Left: We overlay a simple graph on the image, showing edge lengths corresponding to feature distances between points. Right: The hierarchical segmentation derived from the graph on the left. The numbers on the tree indicate the ultrametric distance between nodes on the two branches.

In our setup, we use Eq. 2 to define an ultrametric contrastive loss for our scene features. We use feature distances as edge lengths in Eq. 1, enabling us to obtain hierarchical segmentations at test time via the Watershed transform.

3.0.3 Segment Anything Model (SAM)

SAM [24] is a state-of-the-art vision foundation model for image segmentation. Given a query in the form of points, a mask, a box, or a language prompt, SAM predicts a segmentation that best reflects the prompt. SAM generates segmentations at three levels of granularity: instance, part, and subpart.

While SAM produces excellent results on single images, it is nontrivial to lift its predictions into 3D. This is because SAM’s predictions are not consistent across viewpoints. For example, a pen may be segmented into multiple parts in a close-up view, then segmented as a single instance in a wider view, and then be completely missed when viewed from farther away. Furthermore, it is unclear how to best lift outputs from a “queryable" design into a well-organized 3D representation: given an abundance of queries, SAM will generate an abundance of masks, which overlap with one another unpredictably, and have no straightforward unification.

In our setup, we use SAM to provide a noisy signal of viable segmentations within each viewpoint, and rely on multi-view feature field optimization, using ultrametrics, to distill this knowledge into a 3D scene segmentation which is view-consistent and hierarchical.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Hierarchical Segmentation: Our method can hierarchically segment real world scenes at various levels of granularity.

4 Method

The previous section described the core components of our approach. In this section we describe how these pieces fit together, and explain the training and inference pipelines.

4.1 Learning an Ultrametric Feature Field

4.1.1 Problem Formulation

We take as input a set of multi-view images, along with their camera parameters. We run SAM on every image, typically yielding 50-150 masks per image. Our goal is to learn an implicit feature field that encodes a hiearchical understanding of these masks.

4.1.2 Contrastive Learning

The SAM masks are inconsistent across views, but carry a great deal of information within each view. We use contrastive learning  [22, 11, 41] to distill this information into a 3D feature field.

We sample a pair of pixels in an image, and define it as a positive pair if both lie within a same mask, and a negative pair otherwise. We supervise the features of positive pairs to be more similar than those of negative pairs. Concretely, we follow Chen et al. [11] and minimize a binary cross entropy loss on distances between positive pairs and negative pairs. Given a positive pair sp={vp1,vp2}subscript𝑠𝑝subscript𝑣𝑝1subscript𝑣𝑝2s_{p}=\{v_{p1},v_{p2}\}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_p 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p 2 end_POSTSUBSCRIPT } and a negative pair sn={vn1,vn2}subscript𝑠𝑛subscript𝑣𝑛1subscript𝑣𝑛2s_{n}=\{v_{n1},v_{n2}\}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT }, our contrastive loss is

(sp,sn)=log(ed(vp1,vp2)/τed(vp1,vp2)/τ+ed(vn1,vn2)/τ)+log(ed(vn1,vn2)/τed(vp1,vp2)/τ+ed(vn1,vn2)/τ),subscript𝑠𝑝subscript𝑠𝑛superscript𝑒𝑑subscript𝑣𝑝1subscript𝑣𝑝2𝜏superscript𝑒𝑑subscript𝑣𝑝1subscript𝑣𝑝2𝜏superscript𝑒𝑑subscript𝑣𝑛1subscript𝑣𝑛2𝜏superscript𝑒𝑑subscript𝑣𝑛1subscript𝑣𝑛2𝜏superscript𝑒𝑑subscript𝑣𝑝1subscript𝑣𝑝2𝜏superscript𝑒𝑑subscript𝑣𝑛1subscript𝑣𝑛2𝜏\begin{split}\ell(s_{p},s_{n})&=-\log(\frac{e^{d(v_{p1},v_{p2})/\tau}}{e^{d(v_% {p1},v_{p2})/\tau}+e^{d(v_{n1},v_{n2})/\tau}})\\ &+\log(\frac{e^{d(v_{n1},v_{n2})/\tau}}{e^{d(v_{p1},v_{p2})/\tau}+e^{d(v_{n1},% v_{n2})/\tau}})\,,\\ \end{split}start_ROW start_CELL roman_ℓ ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL start_CELL = - roman_log ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_p 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_p 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + roman_log ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_p 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_d ( italic_v start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG ) , end_CELL end_ROW (3)

where d𝑑ditalic_d is a distance metric, and τ𝜏\tauitalic_τ is the temperature.

Minimizing the loss in Eq. 3 across many pairs spsubscript𝑠𝑝s_{p}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT yields a feature space where segmentations can be recovered by querying a point and finding the neighborhood where pairwise feature distances are all below some threshold. As discussed in earlier sections, whether or not the resulting segmentation will be hierarchical depends on whether the distance metric d𝑑ditalic_d is an ultrametric or a standard Euclidean metric. Our main goal is to reduce ultrametric distances, but to improve optimization, we apply loss in both ultrametric space and Euclidean space:

Lfeat=(sp,sn)𝒮ultra(sp,sn)+αEuclid(sp,sn).subscript𝐿𝑓𝑒𝑎𝑡subscriptsubscript𝑠𝑝subscript𝑠𝑛𝒮subscriptultrasubscript𝑠𝑝subscript𝑠𝑛𝛼subscriptEuclidsubscript𝑠𝑝subscript𝑠𝑛L_{feat}=\sum_{(s_{p},s_{n})\in\mathcal{S}}\ell_{\textrm{ultra}}(s_{p},s_{n})+% \alpha\ell_{\textrm{Euclid}}(s_{p},s_{n})\,.italic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_S end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT ultra end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_α roman_ℓ start_POSTSUBSCRIPT Euclid end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (4)

The ultrametric term forces the optimization to find a hierarchical decomposition of the scene, while the Euclidean term serves as regularization. Note that the Euclidean term has the advantage of directly providing a gradient to all considered feature pairs, whereas the ultrametric loss only provides a loss to the active (maximal) edge along the path between two segments.

As described in Eqs. 1 and 2, ultrametric distances are defined on a graph. At each training step, we form an approximate graph by sampling 4096 pixels within the image, connect each pixel to its 10 nearest neighbors, and use feature distances as edge weights. We use the binary partition tree algorithm [13] to efficiently compute ultrametric distances during training.

4.1.3 Hierarchical Sampling

Because our input segmentations overlap within and across views, it is ambiguous whether a pair of pixels “lie within the same mask” and therefore represent a positive pair in our contrastive formulation. For example, two pixels may be within the same mask at a coarse granularity but not for a finer granularity. We address this ambiguity with a simple tree-based strategy for sampling positive and negative pairs of pixels.

For each image, we organize the segmentation masks into a hierarchical structure determined by the inclusion ratio between them. We additionally include an all-positive mask as the root of the tree. Then, for each view, we sample positive and negative pairs starting from the leaf nodes of the hierarchical tree, i.e. masks at the finest granularity. For a leaf mask A𝐴Aitalic_A with parent mask B𝐵Bitalic_B, we randomly select two pixels in A𝐴Aitalic_A and designate these as a positive pair. We then randomly select one pixel in A𝐴Aitalic_A and one pixel in A¯B¯𝐴𝐵\bar{A}\cap Bover¯ start_ARG italic_A end_ARG ∩ italic_B and designate these as a negative pair. We then move to the parent mask and repeat. Additional details are provided in the supplementary.

We note that masks grow in size as we rise up the hierarchy, and pairs which were declared negative at finer granularity will be declared positive in courser segmentations. We mitigate this conflict by computing the contrastive loss within each level. This asks that the positive distances be smaller than the negative distances within a single level, and avoids the possibility of a pair being simultaneously positive and negative in Eq. 3. Taken together, our sampling strategy reflects an ultrametric structure and helps the segmentation hierarchy propagate across the scene.

4.1.4 Improving Depth with Segmentation

Our approach shows that the 3D structure of a scene can resolve conflicts in segmentation cues. Conversely, can segmentation cues help resolve ambiguity in 3D structure? To explore this possibility, we add an assumption that regions belonging to the same semantic mask have smoothly changing depth. We propose a regularization that penalizes changes in curvature (i.e. the third derivative of depth) within a segment. Concretely, at each training iteration, we sample kdepthsubscript𝑘𝑑𝑒𝑝𝑡k_{depth}italic_k start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT local patches of 4×4444\times 44 × 4 pixels, ensuring that each sampled patch lies within one of the SAM’s finest-grained mask predictions. The depth continuity loss is defined as

Ldc=p0PDmax((dp03dp1+3dp2dp3)(max(dp0,dp1,dp2,dp3)Δθ)3t,0)subscript𝐿𝑑𝑐subscriptsubscript𝑝0subscript𝑃𝐷subscript𝑑subscript𝑝03subscript𝑑subscript𝑝13subscript𝑑subscript𝑝2subscript𝑑subscript𝑝3superscriptsubscript𝑑subscript𝑝0subscript𝑑subscript𝑝1subscript𝑑subscript𝑝2subscript𝑑subscript𝑝3Δ𝜃3𝑡0L_{dc}=\sum_{p_{0}\in P_{D}}{\max(\frac{(d_{p_{0}}-3d_{p_{1}}+3d_{p_{2}}-d_{p_% {3}})}{(\max(d_{p_{0}},d_{p_{1}},d_{p_{2}},d_{p_{3}})\Delta\theta)^{3}}-t,0)}italic_L start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( divide start_ARG ( italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - 3 italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 3 italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ( roman_max ( italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_Δ italic_θ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG - italic_t , 0 ) (5)

where ΔθΔ𝜃\Delta\thetaroman_Δ italic_θ represents the ray angle difference between adjacent pixels, t𝑡titalic_t denotes a threshold, and PDsubscript𝑃𝐷P_{D}italic_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT contains the sampled pixels, with p0,,p3subscript𝑝0subscript𝑝3p_{0},\ldots,p_{3}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT denoting adjacent pixels in a row or column. For training stability, we start using this loss halfway through training. Fig. 5 presents the benefits of the depth continuity loss.

4.2 Segmentation from Ultrametric Features

After training our ultrametric feature field, we can perform 2D or 3D hierarchical segmentation by applying the Watershed transform on either rendered feature maps or the 3D feature field itself.

4.2.1 Segmenting in 2D

To segment a 2D image from our feature field, we begin by rendering our feature field to the viewpoint of the image. We then construct a graph from the feature map, denoted as 𝒢=(V,E)𝒢𝑉𝐸\mathcal{G}=(V,E)caligraphic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V contains the pixel features, and E𝐸Eitalic_E contains edges connecting each pixel to its 4 spatial neighbors. Edge lengths are defined as feature distances. With a given threshold t𝑡titalic_t as the indicator of granularity, we remove all edges longer than this threshold, resulting in a new graph denoted as 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We then identify all connected components within this graph, and return these components as our segmentation. We use this approach to compare against 2D segmentation methods, but we note that 2D-based segmentation is not view-consistent.

4.2.2 Segmenting in 3D

To achieve 3D-consistent segmentation, we create a featurized 3D point cloud by unprojecting feature maps and depth maps rendered from the optimized implicit field. Following related NeRF works, we remove outliers and downsample the pointcloud using Open3D [64]. Then, we construct a k-nearest-neighbor (KNN) graph on the 3D point cloud, and we set each edge weight in the KNN graph to the feature distance between the connected vertices. For a threshold t𝑡titalic_t indicating the level of granularity, we remove all edges longer than that threshold, and the remaining connected components represent segments at the specified level of granularity. In practice, we keep the N𝑁Nitalic_N largest components as our final segmentation. Using different values of t𝑡titalic_t allows for segmentation at varying levels of granularity.

To propagate the 3D segmentation into a novel view, we first render a depth map of the novel view and unproject the render into 3D points. Then, for each unprojected pixel, we find its k𝑘kitalic_k nearest neighbors to the previously estimated KNN graph and assign a segmentation label using the mode of the neighborhood.

Refer to caption
(a) Image
Refer to caption
(b) Original
Refer to caption
(c) DC
Refer to caption
(d) DC & COLMAP
Figure 5: Depth Continuity: Our depth continuity loss (labeled as DC above) leads to smoother and more plausible depth estimation, and can be seamlessly combined with additional depth cues such as COLMAP [44], resulting in even greater accuracy.

5 Experiments

We report qualitative and quantitative evaluations for our method and a set of relevant baselines. In Sec. 5.1 we give an overview of the datasets we evaluate on, and we define our evaluation metrics. In Sec. 5.2 and Sec. 5.3, we present quantitative comparisons and ablation studies. We present implementation details in Sec. 5.4.

5.1 Datasets and Metrics

PartNet Dataset

The official PartNet dataset  [36, 8] contains 26,6712667126,67126 , 671 3D models with professionally verified hierarchical part decomposition. Furthermore, each object model is rendered into 24 viewpoints that uniformly cover the viewing sphere. PartNet offers us the unique ability to evaluate on hierarchical segmentation that verifiably aligns with human perception of structural hierarchy. We evaluate on five objects from each of the chair, table, and storage furniture categories, totalling to 15 objects models.

Blender with Hierarchical Segmentation

While PartNet contains 3D objects with hierarchical part annotations, each model is visually simplistic and lacks the photometric complexity often associated with NeRFs. To the best of our knowledge, no publicly available dataset exists that displays both complex photometric structure as well as 3D-consistent hierarchical segmentation labels. Thus, we create a new synthetic dataset based on the Blender Dataset [33] which we call Blender with Hierarchical Segmentation (Blender-HS). We choose three scenes from the Blender Dataset that exhibit clear hierarchical structure: Lego, Hotdog, and Drums, and for each we define three granularity levels, which we denote as “scene”, “collection”, and “object”. The object level consists of each distinct 3D asset; the “collection” level contains 3D asset groupings that were decided by the original artist; the “scene” level consists of the entire scene. While many potential hierarchies exist in general, the artist-defined groupings naturally represent a hierarchy that makes sense for humans. We present visualizations of the hierarchical ground truth in the supplementary. We use 100 views in the training set for training and 10 views from the validation set for evaluation.

Normalized Covering Score

To assess the quality of hierarchical segmentation, we use the Normalized Covering (NC) score [19]. This metric averages the Intersection over Union (IoU) between each ground truth mask and the best-matching (i.e. most-overlapping) predicted mask. We calculate a separate NC score for each granularity and report the mean of these scores. We also adapt the metric to evaluate the accuracy of point clouds segmentation on PartNet [36].

Segmentation Injectivity Score

In hierarchical segmentation, it is important that each pixel belongs to only one mask for each level of granularity. To measure this, we propose a metric which we call Segmentation Injectivity (SI). For each ground truth segmentation, we retrieve its best-matching predicted mask and sample two random points p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from that mask. We then query the model for a new mask at each of these points, at the same granularity. We compute the Intersection over Union (IoU) between the two resulting masks. Note that a perfect model will return the same mask at both locations, since these points belong to the same ground-truth segment, whereas a worse model will return masks which may not even overlap. Since the random point sampling introduces randomness, we run this 100 times per ground truth mask and average the scores across all ground truth masks and all viewpoints. We find this is more tractable than estimating masks densely, and in practice exhibits low variance. Note there is a trade-off between the SI score and the NC score: predicting an excessive number of overlapping masks could inflate the NC score but would significantly lower the SI score.

View Consistency Score

We use a View Consistency (VC) score to measure the 3D consistency of image segmentations. The key idea of this score is to estimate segmentations in two nearby viewpoints independently, and then warp these estimates onto one another, and measure their agreement. To compute this score, we begin by defining a pixel transformation T𝑇Titalic_T from a source viewpoint to another one shifted 10 degrees, using ground truth depth and camera pose. Next, for each ground truth segmentation, we retrieve its best-matching predicted mask, sample a random point p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from that mask, and compute the point’s location in the shifted viewpoint p2=T(p1)subscript𝑝2𝑇subscript𝑝1p_{2}=T(p_{1})italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). We then obtain segmentation estimates to compare, centered on p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the first view and centered on p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the second. Finally we warp p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT’s mask into the p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT viewpoint, and measure IoU with the p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mask. To ensure that the IoU does not merely reveal occlusion/disocclusion differences, we use ground truth visibility masks to remove pixels which are invisible in either view, before computing the IoU. Similar to the other scores, we compute this score 100 times per ground truth mask and average the scores across all ground truth masks and all viewpoints.

5.2 View-Consistent Hierarchical Segmentation

Table 1: Results on Blender-HS. We report the Normalized Covering (NC), Segmentation Injectivity (SI), and View Consistency (VC) as a percentage. NCobjobj{}_{\text{obj}}start_FLOATSUBSCRIPT obj end_FLOATSUBSCRIPT, NCcollcoll{}_{\text{coll}}start_FLOATSUBSCRIPT coll end_FLOATSUBSCRIPT, and NCscenescene{}_{\text{scene}}start_FLOATSUBSCRIPT scene end_FLOATSUBSCRIPT refer to the NC score on objects, collections, and scenes, respectively.
Method NCobj \uparrow NCcoll \uparrow NCscene \uparrow NCmean \uparrow SI \uparrow VC \uparrow
LSeg [28] 21.8 40.6 67.0 43.2 69.8 53.6
LSeg + DFF [26] 16.4 42.7 83.9 47.7 84.8 82.7
LeRF [21] 26.6 41.3 85.7 51.2 59.8 64.1
SAM [24] 44.3 65.2 78.7 62.7 80.8 46.2
SAM3D [7] 46.0 59.3 53.2 52.8 - 72.4
Ours, 2D 50.0 63.0 94.0 69.0 100.0 67.8
Ours 48.0 67.7 97.1 70.9 100.0 78.9
Baselines

Because the exact task of 3D-consistent hierarchical segmentation from images collections is less explored, we adapt state-of-the-art NeRF baselines to this setting. We adapt language-based methods LSeg [28], DFF [26], and LeRF [21] to predict a segmentation mask based on a 2D pixel query, by querying the rendered feature map and estimating a segmentation mask via feature similarity. We generate multi-granular predictions by using different thresholds on the feature similarity. We also adapt SAM3D [7] as a baseline, which we modify to produce multiple segmentations instead of only one. We propagate SAM outputs from 20 training views into the novel view to generate a variety of masks.

Quantitative Evaluation

In Table 1, we report segmentation metrics for our 2D inference mode, our 3D inference mode, and our baselines. Our method significantly outperforms all other methods in Normalized Covering score (i.e. segmentation accuracy). We even outperform SAM, suggesting that our ultrametric feature field not only distills thousands of SAM predictions into a compact set of hierarchical and view-consistent masks, but that our masks are more accurate than any of the original SAM predictions. Due to the Watershed transform, we also achieve a perfect score on segmentation injectivity (SI), while the other methods (which are non-hierarchical) tend to predict overlapping masks. Finally, our 3D inference shows significant improvements in view consistency over all methods except DFF. We also point out that DFF’s object-level NC score suggests that it does not segment small objects, which are significantly more challenging to segment in a view-consistent manner, and this fact may inflate the VC score for the method.

We additionally visualize our ultrametric feature field on the Lego scene in the supplementary, showing sharper features than DFF.

Qualitative Analysis

We provide qualitative validation of our method on two real-world scenes from the Tanks and Temples Dataset [25]. In Fig. 4, we visualize three segmentation granularities on the truck scene. In Fig. 1, we show our segmentations on the Bobcat tractor scene, which resolves inconsistencies from the source SAM masks. Our method’s segmentations tend to have sharp boundaries, and the hierarchical structure decomposes the truck and tractor into intuitive parts and subparts.

Partnet Experiments

Each object in PartNet has a unique hierarchical part decomposition that may vary in both the number of parts and number of relationships between parts. To evaluate the part hierarchy using our NC metric, we categorize the part hierarchy of each object into three levels: finest, middle, and coarse. The finest level corresponds to the leaf level, while the coarse level represents the root level (i.e., full object), and the rest are the middle level. In Table 2, we compare our approach with LSeg+DFF [26] and SAM3D [7], two other methods that produce 3D-consistent segmentation. Our method surpasses both in NC score, while also being the only method to produce hierarchical segmentation. We also report the NC Score based on point clouds and compare our method with 3D-PIS [48], a supervised point clouds segmentation network. We can see that our method can achieve comparable NC Score on the finest level. Our method falls behind on the middle level, as the ground truth hierarchical tree is constructed semantically, 3D-PIS learns these semantics from supervision while ours does not.

Table 2: Evaluation in PartNet. We report the Normalized Covering (NC) as a percentage on PartNet dataset. NCfinefine{}_{\text{fine}}start_FLOATSUBSCRIPT fine end_FLOATSUBSCRIPT, NCmidmid{}_{\text{mid}}start_FLOATSUBSCRIPT mid end_FLOATSUBSCRIPT, and NCcoarcoar{}_{\text{coar}}start_FLOATSUBSCRIPT coar end_FLOATSUBSCRIPT refer to the NC score on three different hierarchy levels. means the method is evaluated based on point clouds.
Method NCfine \uparrow NCmid \uparrow NCcoar \uparrow NCmean \uparrow
LSeg + DFF [26] 22.4 48.5 70.6 47.2
SAM3D [7] 51.0 60.8 66.8 59.6
Ours 49.9 63.3 82.8 65.3
Ours 48.9 60.4 - -
3D-PIS [48] 52.2 75.0 - -

5.3 Ablation Experiments

Hierarchical Data Sampling

In Tab. 4 we evaluate the influence of hierarchical data sampling for contrastive learning. In the baseline approach, positive pairs are randomly sampled within each mask, and negative pairs are sampled with one point inside and one point outside. Results indicate that using the hierarchical sampling strategy improves the overall performance by 5.75.75.75.7 points.

Ultrametric training

In Tab. 4 we report the impact of the ultrametric loss, ultrasubscriptultra\ell_{\textrm{ultra}}roman_ℓ start_POSTSUBSCRIPT ultra end_POSTSUBSCRIPT. In the baseline method, we train the feature space using only the Euclidean loss, EuclidsubscriptEuclid\ell_{\textrm{Euclid}}roman_ℓ start_POSTSUBSCRIPT Euclid end_POSTSUBSCRIPT. Results show that the ultrametric training improves the segmentation performance by 2.12.12.12.1 points.

Depth Continuity Loss

Tab. 4 shows the influence of using the depth continuity loss during training. In addition to evaluating the NC score, we evaluate depth estimation accuracy, using mean 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error. Incorporating the depth continuity loss improves both segmentation accuracy and depth accuracy. Fig. 5 qualitatively illustrates the benefit of the depth continuity loss in real data.

Table 3: Ablation on Hierarchical Sampling (HS) and Ultrametric Training (UT).
Method NCobjobj{}_{\text{obj}}start_FLOATSUBSCRIPT obj end_FLOATSUBSCRIPT NCcollcoll{}_{\text{coll}}start_FLOATSUBSCRIPT coll end_FLOATSUBSCRIPT NCscenescene{}_{\text{scene}}start_FLOATSUBSCRIPT scene end_FLOATSUBSCRIPT NCmeanmean{}_{\text{mean}}start_FLOATSUBSCRIPT mean end_FLOATSUBSCRIPT
Ours 48.0 67.7 97.1 70.9
w/o UT 47.1 62.8 96.4 68.8
w/o HS 42.6 64.0 89.0 65.2
Table 4: Ablation on Depth Continuity (DC) loss.
Method NCmeanmean{}_{\text{mean}}start_FLOATSUBSCRIPT mean end_FLOATSUBSCRIPT Depth Error
Ours 70.9 0.059
w/o DC 67.1 0.089

5.4 Implementation Details

We implement our method with help from the publicly available codebase from DFF [26]. We model our feature branch after the RGB branch of Instant-NGP [37] – we employ a multi-resolution grid hash encoder to transform 3D coordinates into features, followed by an MLP. For the grid hash encoder, we configure the number of levels to 17171717, features per level to 4444, and the hash map size to 220superscript2202^{20}2 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT. The MLP comprises three hidden layers with 128128128128 dimensions each, producing a final output feature of 256256256256 dimensions. We provide additional implementation details in the supplementary.

For our quantitative evaluation on Blender-HS, we compute the NC score of LSeg, DFF, LeRF, and our method across 50 distance thresholds ranging from 0.010.010.010.01 to 0.500.500.500.50. We follow DFF [26] and perform all evaluation on 4×4\times4 × downsampled images. For all methods, we exclude masks containing fewer than 20202020 pixels.

6 Discussion and Limitations

A key limitation of our approach is its dependence on high-quality point clouds produced by the NeRF. While the depth-smoothing loss outlined in Section 5.3 improves point cloud quality, there is much room for further improvement. Regarding our evaluation, we find there is a scarcity of high-quality datasets for hierarchical 3D segmentation that exhibit complex scene structure. Additionally, it is ambiguous which hierarchies are the most meaningful and appropriate in complex environments, without defining end-tasks that rely on these hierarchies. While our Blender-HS Dataset is a first step in providing ground truth on hierarchical 3D segmentation in complex scenes, it is limited to only three scenes. We hope that future efforts can develop better and larger hierarchical 3D datasets, to enable more comprehensive evaluations.

7 Conclusion

Consistent hierarchical 3D segmentation is essential for many applications involving mobile agents interacting with the real world at scale, such as robotics and augmented reality. In this work we have demonstrated significant progress towards achieving consistent hierarchical segmentation, building on state-of-the-art systems that perform multi-granularity segmentation in images, whose output predictions are neither hierarchical nor consistent across different views. Our ultrametric feature field distills this inconsistent 2D information into a representation that can be queried at will for arbitrary-granularity segmentations that are consistent across views.

Acknowledgements

This work was supported by a Vannevar Bush Faculty Fellowship and ARL grant W911NF-21-2-0104.

References

  • [1] Arbelaez, P.: Boundary extraction in natural images using ultrametric contour maps. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06). pp. 182–182 (2006). https://doi.org/10.1109/CVPRW.2006.48
  • [2] Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33(5), 898–916 (2010)
  • [3] Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5855–5864 (2021)
  • [4] Beucher, S.: Watershed, hierarchical segmentation and waterfall algorithm. Mathematical morphology and its applications to image processing pp. 69–76 (1994)
  • [5] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers (2021)
  • [6] Cen, J., Fang, J., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment any 3d gaussians (2024), https://arxiv.org/abs/2312.00860
  • [7] Cen, J., Zhou, Z., Fang, J., Shen, W., Xie, L., Jiang, D., Zhang, X., Tian, Q.: Segment anything in 3d with nerfs (2023)
  • [8] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
  • [9] Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)
  • [10] Chen, A., Xu, Z., Wei, X., Tang, S., Su, H., Geiger, A.: Factor fields: A unified framework for neural fields and beyond. ArXiv abs/2302.01226 (2023), https://api.semanticscholar.org/CorpusID:256503583
  • [11] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)
  • [12] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290–1299 (2022)
  • [13] Cousty, J., Najman, L., Kenmochi, Y., Guimarães, S.: Hierarchical segmentations with graphs: quasi-flat zones, minimum spanning trees, and saliency maps. Journal of Mathematical Imaging and Vision 60(4), 479–502 (2018)
  • [14] De Goes, F., Goldenstein, S., Velho, L.: A hierarchical segmentation of articulated bodies. Computer graphics forum 27(5), 1349–1356 (2008)
  • [15] Deng, K., Liu, A., Zhu, J., Ramanan, D.: Depth-supervised nerf: Fewer views and faster training for free. CoRR abs/2107.02791 (2021), https://arxiv.org/abs/2107.02791
  • [16] Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: Universal human parsing via graph transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7450–7459 (2019)
  • [17] Hedman, P., Srinivasan, P.P., Mildenhall, B., Barron, J.T., Debevec, P.E.: Baking neural radiance fields for real-time view synthesis. CoRR abs/2103.14645 (2021), https://arxiv.org/abs/2103.14645
  • [18] Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
  • [19] Ke, T.W., Hwang, J.J., Guo, Y., Wang, X., Yu, S.X.: Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2571–2581 (2022)
  • [20] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4) (July 2023), https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  • [21] Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields (2023)
  • [22] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Advances in neural information processing systems 33, 18661–18673 (2020)
  • [23] Kim, C.M., Wu, M., Kerr, J., Tancik, M., Goldberg, K., Kanazawa, A.: Garfield: Group anything with radiance fields. In: arXiv (2024)
  • [24] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023)
  • [25] Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36(4) (2017)
  • [26] Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. In: Advances in Neural Information Processing Systems. vol. 35 (2022), https://arxiv.org/pdf/2205.15585.pdf
  • [27] Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L., Tagliasacchi, A., Dellaert, F., Funkhouser, T.: Panoptic neural fields: A semantic object-aware neural scene representation. In: CVPR (2022)
  • [28] Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation (2022)
  • [29] Li, L., Zhou, T., Wang, W., Li, J., Yang, Y.: Deep hierarchical semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1246–1257 (2022)
  • [30] Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. NeurIPS (2020)
  • [31] Liu, X., Chen, J., Yu, H., Tai, Y.W., Tang, C.K.: Unsupervised multi-view object segmentation using radiance field propagation (2022)
  • [32] Liu, Y., Hu, B., Huang, J., Tai, Y.W., Tang, C.K.: Instance neural radiance field (2023)
  • [33] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis (2020)
  • [34] Milligan, G.W.: Ultrametric hierarchical clustering algorithms. Psychometrika 44(3), 343–346 (1979)
  • [35] Mirzaei, A., Aumentado-Armstrong, T., Derpanis, K.G., Kelly, J., Brubaker, M.A., Gilitschenski, I., Levinshtein, A.: Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields (2023)
  • [36] Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
  • [37] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15 (Jul 2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.1145/3528223.3530127
  • [38] Najman, L., Schmitt, M.: Geodesic saliency of watershed contours and hierarchical segmentation. IEEE Transactions on pattern analysis and machine intelligence 18(12), 1163–1173 (1996)
  • [39] Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S.M., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5470–5480 (2021), https://api.semanticscholar.org/CorpusID:244773517
  • [40] Oechsle, M., Peng, S., Geiger, A.: Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In: International Conference on Computer Vision (ICCV) (2021)
  • [41] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021)
  • [42] Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Nießner, M.: Dense depth priors for neural radiance fields from sparse input views. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 12882–12891 (2021), https://api.semanticscholar.org/CorpusID:244921004
  • [43] Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Nießner, M.: Dense depth priors for neural radiance fields from sparse input views (2022)
  • [44] Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: European Conference on Computer Vision (ECCV) (2016)
  • [45] Siddiqui, Y., Porzi, L., Bul’o, S.R., Muller, N., Nießner, M., Dai, A., Kontschieder, P.: Panoptic lifting for 3d scene understanding with neural fields. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 9043–9052 (2022), https://api.semanticscholar.org/CorpusID:254877618
  • [46] Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Learning hierarchical models of scenes, objects, and parts. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. vol. 2, pp. 1331–1338 Vol. 2 (2005). https://doi.org/10.1109/ICCV.2005.137
  • [47] Sun, C.Y., Yang, Y.Q., Guo, H.X., Wang, P.S., Tong, X., Liu, Y., Shum, H.Y.: Semi-supervised 3d shape segmentation with multilevel consistency and part substitution (2022)
  • [48] Sun, C., Tong, X., Liu, Y.: Semantic segmentation-assisted instance feature fusion for multi-level 3d part instance segmentation (2022)
  • [49] Tschernezki, V., Laina, I., Larlus, D., Vedaldi, A.: Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In: 2022 International Conference on 3D Vision (3DV). pp. 443–453. IEEE (2022)
  • [50] Uy, M.A., Martin-Brualla, R., Guibas, L., Li, K.: Scade: Nerfs from space carving with ambiguity-aware depth estimates. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
  • [51] Vora, S., Radwan, N., Greff, K., Meyer, H., Genova, K., Sajjadi, M.S.M., Pot, E., Tagliasacchi, A., Duckworth, D.: Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes (2021)
  • [52] Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. ArXiv abs/2106.10689 (2021), https://api.semanticscholar.org/CorpusID:235490453
  • [53] Wang, W., Zhang, Z., Qi, S., Shen, J., Pang, Y., Shao, L.: Learning compositional neural information fusion for human parsing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5703–5713 (2019)
  • [54] Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L.: Hierarchical human parsing with typed part-relation reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8929–8939 (2020)
  • [55] Xu, X., Yang, Y., Mo, K., Pan, B., Yi, L., Guibas, L.: Jacobinerf: Nerf shaping with mutual information gradients (2023)
  • [56] Xu, Y., Carlinet, E., Géraud, T., Najman, L.: Hierarchical segmentation using tree-based shape spaces. IEEE transactions on pattern analysis and machine intelligence 39(3), 457–469 (2016)
  • [57] Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021)
  • [58] Yarkony, J.E., Fowlkes, C.: Planar ultrametrics for image segmentation. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 28. Curran Associates, Inc. (2015), https://proceedings.neurips.cc/paper_files/paper/2015/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf
  • [59] Yin, Y., Fu, Z., Yang, F., Lin, G.: Or-nerf: Object removing from 3d scenes guided by multiview segmentation with neural radiance fields (2023)
  • [60] Ying, H., Yin, Y., Zhang, J., Wang, F., Yu, T., Huang, R., Fang, L.: Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning (2023)
  • [61] Yu, Z., Chen, A., Antic, B., Peng, S.P., Bhattacharyya, A., Niemeyer, M., Tang, S., Sattler, T., Geiger, A.: Sdfstudio: A unified framework for surface reconstruction (2022), https://github.com/autonomousvision/sdfstudio
  • [62] Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A.: Open vocabulary scene parsing (2017)
  • [63] Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation (2021)
  • [64] Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data processing. arXiv:1801.09847 (2018)
  • [65] Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low-shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation (2023), https://arxiv.org/abs/2312.03015

8 Supplementary Material

We provide additional implementation details in Sec. A. We present additional qualitative results on Blender-HS dataset, PartNet dataset, and LLFF dataset in Sec. B. Additionally, we include a video attachment with visualizations of the view-consistent hierarchical segmentation results.

A. Additional Implementation Details

8.0.1 A.1. Hyperparameters

On Blender-HS and PartNet, we train our model for 20,0002000020,00020 , 000 iterations with a batch size of 4096409640964096 and use the same optimization parameters as DFF. In contrastive learning (see Eq. 3), we set the temperature τ𝜏\tauitalic_τ to 0.10.10.10.1 and sample 64646464 positive and negative pairs from each mask. We set the loss weight α𝛼\alphaitalic_α of the Euclidean loss in Eq. 4 to 1. For depth continuity loss, we sample 16161616 patches per mask, and begin using the depth continuity loss after 5000500050005000 iterations. During 3D inference, we extract a point cloud from training-view depth maps, apply voxel downsampling with voxel size 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and run outlier removal with distance threshold 4×1034superscript1034\times 10^{-3}4 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and number threshold of 1111. We build the graph of points using kgraph=16subscript𝑘𝑔𝑟𝑎𝑝16k_{graph}=16italic_k start_POSTSUBSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUBSCRIPT = 16 nearest neighbors, and we transfer point segmentation labels into a novel view using the mode of kquery=5subscript𝑘𝑞𝑢𝑒𝑟𝑦5k_{query}=5italic_k start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT = 5 nearest neighbors. We retain N=200𝑁200N=200italic_N = 200 graph components and set the distance threshold d𝑑ditalic_d to be 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Please refer to  Sec. 4 for definitions of the above hyperparameters.

A.2. Hierarchical Sampling

We first organize the segmentation masks into a hierarchical structure determined by the inclusion ratio between them. One mask A𝐴Aitalic_A is designated as a child of another mask B𝐵Bitalic_B when|AB||A|>pin𝐴𝐵𝐴subscript𝑝𝑖𝑛\frac{|A\cap B|}{|A|}>p_{in}divide start_ARG | italic_A ∩ italic_B | end_ARG start_ARG | italic_A | end_ARG > italic_p start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and |AB||AB|<pIoU𝐴𝐵𝐴𝐵subscript𝑝𝐼𝑜𝑈\frac{|A\cap B|}{|A\cup B|}<p_{IoU}divide start_ARG | italic_A ∩ italic_B | end_ARG start_ARG | italic_A ∪ italic_B | end_ARG < italic_p start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT. We empirically set pin=0.95subscript𝑝𝑖𝑛0.95p_{in}=0.95italic_p start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = 0.95 and pIoU=0.85subscript𝑝𝐼𝑜𝑈0.85p_{IoU}=0.85italic_p start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT = 0.85. We present the hierarchical sampling algorithm we introduced in Sec. 4.1.3 in Algorithm 1.

Algorithm 1 Data Sampling
  pos_samples \leftarrow []
  neg_samples \leftarrow []
  for all A𝐴Aitalic_A \in leaf_masks do
     # Sampling positive pairs from the leaf node
     sample \leftarrow (Random(A𝐴Aitalic_A), Random(A𝐴Aitalic_A))
     while A𝐴Aitalic_A has Parent do
        B𝐵Bitalic_B \leftarrow A𝐴Aitalic_A.Parent
        pos_samples += sample
        # Sampling negative pairs for the current level
        sample \leftarrow (Random(A𝐴Aitalic_A), Random(A¯B¯𝐴𝐵\bar{A}\cap Bover¯ start_ARG italic_A end_ARG ∩ italic_B))
        neg_samples += sample
        A𝐴Aitalic_A \leftarrow B𝐵Bitalic_B
     end while
  end for
  return  pos_samples, neg_samples

We sample same number of positive pairs and negative pairs for training for training efficiency. Implementing a 4-1 ratio (4×4\times4 × more negatives than positives) instead of 1-1, the normalized covering (NC) score increases from 0.709 to 0.720. However, this slow down our training by 43%, primarily due to the time-consuming computation of ultrametric distances and the associated minimum spanning tree.

A.3. Additional Details on Evaluation Metrics

Normalized Covering Score

As discribed in Sec. 5.1, we measure the quality of hierarchical segmentation with the Normalized Covering (NC) score [19]. This metric averages the Intersection over Union (IoU) between each ground truth mask and the best-matching (i.e. most-overlapping) predicted mask. The metric is defined as

NC(SS)=1|S|ASmaxAS|AA||AA|NCsuperscript𝑆𝑆1𝑆subscript𝐴𝑆subscriptsuperscript𝐴superscript𝑆𝐴superscript𝐴𝐴superscript𝐴\mathrm{NC}(S^{\prime}\rightarrow S)=\frac{1}{|S|}\sum\limits_{A\in S}\max% \limits_{A^{\prime}\in S^{\prime}}\frac{|A\cap A^{\prime}|}{|A\cup A^{\prime}|}roman_NC ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_S ) = divide start_ARG 1 end_ARG start_ARG | italic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_A ∈ italic_S end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG | italic_A ∩ italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_A ∪ italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ARG (6)

Where S𝑆Sitalic_S denotes all segmentation masks and Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes all predicted masks. For LSeg, DFF, and LeRF, which only output feature fields without segmentation results (Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), we adopt a similar approach as our method, and we extract segmentations by thresholding feature distances.

Segmentation Injectivity Score

We propose the Segmentation Injectivity (SI) score to measure if each pixel belongs to only one mask for each level of granularity. As described in Sec. 5.1, given a ground truth mask, we first randomly sample p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from that mask, and then query the model at these points and granularity for a new mask prediction. Then, we measure the IoU between the two resulting masks. We iterate this process N=100𝑁100N=100italic_N = 100 for each ground truth mask, calculating scores for each run. The final SI score is obtained by averaging the scores across all ground truth masks and viewpoints.

We represent the segmentation model as F(v,p,t)A𝐹𝑣𝑝𝑡superscript𝐴F(v,p,t)\rightarrow A^{\prime}italic_F ( italic_v , italic_p , italic_t ) → italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where v𝑣vitalic_v denotes the viewpoint, p𝑝pitalic_p represents the pixel query, t𝑡titalic_t corresponds to the granularity level, and Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the resulting segmentation mask. The SI score is defined as

SI(SS)SIsuperscript𝑆𝑆\displaystyle\mathrm{SI}(S^{\prime}\rightarrow S)roman_SI ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_S ) =1|N||S|ASi=1N|F(v,p1i,t)F(v,p2i,t)||(F(v,p1i,t)F(v,p2i,t)|\displaystyle=\frac{1}{|N||S|}\sum\limits_{A\in S}\sum^{N}\limits_{i=1}\frac{|% F(v,p_{1}^{i},t)\cap F(v,p_{2}^{i},t)|}{|(F(v,p_{1}^{i},t)\cup F(v,p_{2}^{i},t% )|}= divide start_ARG 1 end_ARG start_ARG | italic_N | | italic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_A ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT divide start_ARG | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) ∩ italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG start_ARG | ( italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) ∪ italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG
where t=argmaxt|A|F(v,p1i,t)||A|F(v,p1i,t)|\displaystyle t=\arg\max_{t}\frac{|A\cap|F(v,p_{1}^{i},t)|}{|A\cup|F(v,p_{1}^{% i},t)|}italic_t = roman_arg roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG | italic_A ∩ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG start_ARG | italic_A ∪ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG

where v𝑣vitalic_v represents the view corresponding to the ground truth mask A𝐴Aitalic_A.

View Consistency Score

We use the View Consistency (VC) score to measure the 3D consistency of image segmentations. Starting with the source view, we rotate the camera by 10 degrees, rendering both a new image and the corresponding ground truth visibility mask in the shifted view – Fig. 6 provides an example of two viewpoints and their visibility mask on the Blender Hotdog scene.

For a given point query p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a granularity t𝑡titalic_t and its mask prediction A1=F(v,p1,t)subscript𝐴1𝐹𝑣subscript𝑝1𝑡A_{1}=F(v,p_{1},t)italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t ) in the source view, we leverage the ground truth camera parameters to warp the point to p2=T(p1)subscript𝑝2𝑇subscript𝑝1p_{2}=T(p_{1})italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and the mask prediction to T(A1)𝑇subscript𝐴1T(A_{1})italic_T ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) in the shifted view where T𝑇Titalic_T denotes the pixel transformation. Following this, we query the model in the shifted view with p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using the same threshold t𝑡titalic_t, resulting in A2=F(v,p2,t)subscript𝐴2𝐹superscript𝑣subscript𝑝2𝑡A_{2}=F(v^{\prime},p_{2},t)italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_F ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t ).

Utilizing the visibility mask V𝑉Vitalic_V, we eliminate pixels that are occluded in either view from A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and T(A1)𝑇subscript𝐴1T(A_{1})italic_T ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The Intersection over Union (IoU) between the remaining masks is computed as the VC score for this sample. We reduce the noise induced by random sampling by computing this score for N=100𝑁100N=100italic_N = 100 times per ground truth mask.

Taken together, the VC score is defined as

VC(SS)VCsuperscript𝑆𝑆\displaystyle\mathrm{VC}(S^{\prime}\rightarrow S)roman_VC ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_S ) =1|N||S|ASi=1N|T(A1i)A2iV||(T(A1i)A2i)V|absent1𝑁𝑆subscript𝐴𝑆subscriptsuperscript𝑁𝑖1𝑇superscriptsubscript𝐴1𝑖superscriptsubscript𝐴2𝑖𝑉𝑇superscriptsubscript𝐴1𝑖superscriptsubscript𝐴2𝑖𝑉\displaystyle=\frac{1}{|N||S|}\sum\limits_{A\in S}\sum^{N}\limits_{i=1}\frac{|% T(A_{1}^{i})\cap A_{2}^{i}\cap V|}{|(T(A_{1}^{i})\cup A_{2}^{i})\cap V|}= divide start_ARG 1 end_ARG start_ARG | italic_N | | italic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_A ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT divide start_ARG | italic_T ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∩ italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∩ italic_V | end_ARG start_ARG | ( italic_T ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∪ italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∩ italic_V | end_ARG
where A1i=F(v,p1i,argmaxt|A|F(v,p1i,t)||A|F(v,p1i,t)|)\displaystyle A_{1}^{i}=F(v,p_{1}^{i},\arg\max_{t}\frac{|A\cap|F(v,p_{1}^{i},t% )|}{|A\cup|F(v,p_{1}^{i},t)|})italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_arg roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG | italic_A ∩ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG start_ARG | italic_A ∪ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG )
A2i=F(v,p2i,argmaxt|A|F(v,p1i,t)||A|F(v,p1i,t)|)\displaystyle A_{2}^{i}=F(v^{\prime},p_{2}^{i},\arg\max_{t}\frac{|A\cap|F(v,p_% {1}^{i},t)|}{|A\cup|F(v,p_{1}^{i},t)|})italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_F ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_arg roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG | italic_A ∩ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG start_ARG | italic_A ∪ | italic_F ( italic_v , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) | end_ARG )

For additional details, please refer to Sec. 5.1.

We also evaluate the View Consistency across multiple angles, in Tab. 5. The ranking of the methods is the same.

Table 5: View Consistency score with different view angles.
Method VC10subscriptVCsuperscript10\text{VC}_{10^{\circ}}VC start_POSTSUBSCRIPT 10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT \uparrow VC45subscriptVCsuperscript45\text{VC}_{45^{\circ}}VC start_POSTSUBSCRIPT 45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT \uparrow VC90subscriptVCsuperscript90\text{VC}_{90^{\circ}}VC start_POSTSUBSCRIPT 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT \uparrow VC135subscriptVCsuperscript135\text{VC}_{135^{\circ}}VC start_POSTSUBSCRIPT 135 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT \uparrow
LSeg [26] 0.536 0.522 0.510 0.498
LSeg + DFF [24] 0.827 0.813 0.810 0.808
SAM3D [6] 0.724 0.601 0.578 0.539
Ours 0.789 0.763 0.742 0.712
Depth Error

We leverage the ground truth depth map rendered in blender to compute the depth error of our method. The scale of the depth error adheres to the normalized NeRF scene.

Refer to caption
(a) Image
Refer to caption
(b) Shifted Image
Refer to caption
(c) Visibility Mask
Figure 6: View Consistency: We evaluate the view consistency between the source viewpoint (a) and another one shifted 10 degrees (b). We render the ground truth visibility mask (c) with ray casting to avoid the occlusion/disocclusion between views.
Refer to caption
Figure 7: NC Score of SAM3D: We present the Normalized Covering (NC) score (y-axis) of SAM3D, correlating it with the number of views (x-axis) from which we propagate the SAM segmentation masks.
Table 6: Distilled Feature Fields: We present the NC score of DFF with volume rendering (VR) and the NC score of the official codebase on our BlenderHS dataset.
Method NCobjobj{}_{\text{obj}}start_FLOATSUBSCRIPT obj end_FLOATSUBSCRIPT \uparrow NCcollcoll{}_{\text{coll}}start_FLOATSUBSCRIPT coll end_FLOATSUBSCRIPT \uparrow NCscenescene{}_{\text{scene}}start_FLOATSUBSCRIPT scene end_FLOATSUBSCRIPT \uparrow NCmeanmean{}_{\text{mean}}start_FLOATSUBSCRIPT mean end_FLOATSUBSCRIPT \uparrow
DFF (VR) 0.164 0.427 0.839 0.477
DFF (Official) 0.082 0.286 0.666 0.345

8.0.2 A.4. Additional Details on Baselines

DFF

We configure DFF to use a white background and employ uniform ray sampling on the BlenderHS dataset. All other hyperparameters directly adhere to the official implementation. Furthermore, DFF’s official code222https://github.com/pfnet-research/distilled-feature-fields does not apply volume rendering to the feature branch. Instead, it generates a feature map by directly querying the volumetric features at the 3D locations of the predicted surface points. We extended their code to perform volume rendering, and we show that using volume rendering leads to improved performance on the BlenderHS dataset (see Tab. 6).

LeRF

We use LeRF’s reported NSVF hyperparameters for the Blender synthetic dataset. This includes configuring the background to white, selecting uniform sampling as the ray sampling strategy, disabling space distortion, and setting average appearance embedding to off. We train the model for 20000 steps. For the Normalized Covering Score, we report the highest result among all 30 semantic scales available in the LeRF feature field for each ground truth granularity. For the Segmentation Injectivity score and View Consistency scores, we evaluate LeRF at the semantic scale corresponding to the ground truth granularity which yields the highest NC score.

SAM3D

Given a pretrained NeRF and a segmentation mask from a single view, SAM3D optimizes a binary voxel grid using mask inverse rendering and cross-view self-prompting to propagate the mask into 3D. In our experiments, we propagate the SAM masks from the segment-everything mode using 20 training views (while still using all 100 training images to pretrain the NeRF). We observed saturation in SAM3D’s NC score after 20 views, and, on an A6000 GPU, it takes approximately a day per scene to propagate the segmentation maps from 20 views. In contrast, our method takes around 2 hours.

SAM

We employ the ViT-H model from the official SAM GitHub repository333https://github.com/facebookresearch/segment-anything to generate mask predictions. To generate the training data of our model, we use the segment-everything mode to generate our supervision.

In the evaluation process, when querying segmentation models with a randomly sampled point, we employ the point as a prompt for SAM to generate the segmentation prediction. This approach, compared to evaluating based on the output of the segment-everything mode, yields a higher NC score and provides a clearer granularity level.

8.0.3 A.5. Training and Inference Time

We train and perform inference on a Titan RTX GPU. Training typically takes similar-to\sim70 minutes, while inference takes 5 seconds per granularity for 10 views. The main expense in inference is the watershed algorithm running on 3D point clouds, which is executed once per granularity and is view-independent.

B. Additional Qualitative Results

8.0.4 B.1. BlenderHS Dataset

We first visualize the ground truth segmentations for the Drums scene in the BlenderHS Dataset [33] in Fig. 8. We then present qualitative results on the BlenderHS dataset [33] in Fig. 9. Our segmentations exhibit a hierarchical structure and maintain consistency across different views. We also visualize our ultrametric feature field on the Lego scene in Fig. 11, showing sharper features than DFF.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Blender with Hierarchical Segmentation (Blender-HS): We render hierarchical segmentation maps at three levels of granularity, namely Scene, Collection, and Object, using information saved into the blender file by the scene artist.

8.0.5 B.2. PartNet Dataset

We present qualitative results on the PartNet dataset [36] in Fig. 10. Our method is able to generate hierarchical segmentation results of different objects. Leveraging the 2D masks predicted with SAM as guidance, our method proficiently segments various surfaces of sub-parts within the object, while those are not included in the PartNet ground truth annotations.

8.0.6 B.3. LLFF Dataset

We present qualitative results on the the LLFF dataset [33] in Fig. 12. Our approach is able to generate view-consistent hierarchical segmentation results for real-world scenes.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Image
Refer to caption
Refer to caption
(b) Hierarchical Segmentation
Refer to caption
Figure 9: BlenderHS Dataset: We present the qualitative results obtained from our BlenderHS dataset. The segmentation results demonstrate both view consistency and hierarchical structure.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Image
Refer to caption
Refer to caption
(b) Hierarchical Segmentation
Refer to caption
Figure 10: PartNet Dataset: We showcase the qualitative results on the PartNet dataset.
Refer to caption
(a) Image
Refer to caption
(b) DFF [26]
Refer to caption
(c) Ours
Figure 11: Feature Visualization: We visualize rendered feature maps using PCA. The feature map generated by DFF [26] fails to distinguish between different parts of the Lego. In contrast, our method learns features that can distinguish various Lego bricks.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: LLFF Dataset: We showcase the qualitative results on the LLFF dataset.