\newcites

suppReferences

IllumiNeRF: 3D Relighting without Inverse Rendering

Xiaoming Zhao^1,3 Pratul P. Srinivasan² Dor Verbin²
Keunhong Park¹ Ricardo Martin-Brualla¹ Philipp Henzler¹
¹Google Research ²Google DeepMind ³University of Illinois Urbana-Champaign Work done as an intern at Google.

Abstract

Existing methods for relightable view synthesis — using a set of images of an object under unknown lighting to recover a 3D representation that can be rendered from novel viewpoints under a target illumination — are based on inverse rendering, and attempt to disentangle the object geometry, materials, and lighting that explain the input images. Furthermore, this typically involves optimization through differentiable Monte Carlo rendering, which is brittle and computationally-expensive. In this work, we propose a simpler approach: we first relight each input image using an image diffusion model conditioned on lighting and then reconstruct a Neural Radiance Field (NeRF) with these relit images, from which we render novel views under the target lighting. We demonstrate that this strategy is surprisingly competitive and achieves state-of-the-art results on multiple relighting benchmarks. Please see our project page at illuminerf.github.io.

Figure 1: Given a set of posed input images under an unknown lighting (four exemplar images from the set are shown on top), IllumiNeRF produces high-quality novel views (bottom) relit under a target lighting (illustrated as chrome balls). Inputs obtained from the Stanford-ORB dataset [24].

1 Introduction

Capturing an object’s appearance so that it can be accurately rendered in novel environments is a central problem in computer vision whose solution would democratize 3D content creation for augmented and virtual reality, photography, filmmaking, and game development. Recent advances in view synthesis [33] have made impressive progress in reconstructing a 3D representation that can be rendered from novel viewpoints, using just a set of observed images. However, those methods typically only recover the appearance of the object under the captured illumination, and relightable view synthesis — rendering novel views of the captured object under arbitrary target environments — remains challenging.

Recent methods for recovering relightable 3D representations treat this task as inverse rendering, and attempt to estimate the geometry, materials, and illumination that jointly explain the input images using physically-based rendering methods. These approaches typically involve gradient-based optimization through differentiable Monte Carlo rendering procedures, which are noisy and computationally-expensive. Moreover, the inverse rendering optimization problem is brittle and inherently ambiguous; many potential sets of geometry, materials, and lighting can explain the input images, but many of these incorrect explanations produce obviously implausible renderings when rendered under novel unobserved illumination.

We propose a different approach that avoids inverse rendering and instead leverages a generative image model fine-tuned for the task of relighting. Given a set of images viewing an object and a desired target illumination, we use a 2D Relighting Diffusion Model that outputs relit images of the object under the target illumination. Due to the ambiguous nature of the problem, each sample of the generative model encodes a different explanation of the object’s materials, geometry and the input illumination. However, as opposed to optimization-based inverse rendering, such samples are all plausible relit images since they are the output of the trained diffusion model.

Instead of attempting to recover a single explanation of the underlying object’s appearance, we sample multiple plausible relit images for each observed viewpoint, and treat the underlying explanations as samples of unobserved latent variables. To recover a final consistent 3D representation of the relit object, we use the full set of sampled relit images from all viewpoints to train a “latent NeRF” that reconciles all the samples into a single 3D representation, which can be rendered to produce plausible relit images from novel viewpoints.

The key contribution of our work is a new paradigm for relightable 3D reconstruction that replaces 3D inverse rendering with: generating samples with a 2D Relighting Diffusion Model followed by distilling these samples into a 3D latent NeRF representation. We demonstrate that this strategy is surprisingly competitive and outperforms existing most 3D inverse rendering baselines on the TensoIR [20] and Stanford-ORB [24] relighting and view synthesis benchmarks.

2 Related Work

Our work addresses the task of relightable 3D reconstruction by using a lighting-conditioned diffusion model as a generative prior for single-image relighting. It is closely related to prior work in relightable 3D reconstruction, inverse rendering, and single-image relighting. Below, we review these lines of work and discuss how they relate to our proposed approach.

Relightable 3D Reconstruction

The goal of relightable 3D reconstruction is to reconstruct a 3D representation of an object that can be relit by novel illumination conditions and rendered from novel camera poses. In scenarios where an object is observed under multiple lighting conditions [10], it is trivial to render its appearance under novel illumination that is a linear combinations of the observed lighting conditions, due to the linear behavior of light. This approach is generally limited to laboratory capture scenarios where it is possible to observe an object under a lighting basis.

In more casual capture scenarios, the object is observed under just a single or a small handful of lighting conditions. Existing works typically address this setting using methods based on inverse rendering that explicitly factor an object’s appearance into the underlying 3D geometry, object material properties, and lighting that jointly explain the observed images. State-of-the-art approaches to 3D inverse rendering [7, 8, 15, 20, 23, 30, 35, 42, 43] generally utilize the following strategy: they start with a neural field representation of 3D geometry (typically volume density as in NeRF [33], hybrid volume-surface representations as in NeuS [53] and VolSDF [55], or meshes extracted from neural field representations) from the input images, equip the model with a representation of surface materials (e.g. spatially-varying BRDF parameters) and lighting, and jointly optimize these factors through a differentiable physics-based rendering procedure [37]. While methods may differ in their choice of geometry, material, and lighting representations, and employ different techniques to accelerate the evaluation of the rendering integral, they generally all follow this same high-level inverse rendering strategy. Unfortunately, even if the geometry is known, inverse rendering is a notoriously ambiguous problem [39, 48] and many combinations of materials and lighting can explain an object’s appearance. However, not all of these combinations are plausible, and incorrect factorizations that explain observed images under one lighting condition may produce glaring artifacts when rendered under different lighting. Furthermore, differentiable physics-based rendering is computationally-expensive as thousands of samples are needed for Monte Carlo estimates of the rendering integral, typically requires custom implementations [2, 3, 19, 25, 29, 32, 50], and the resulting inverse rendering loss landscape is non-smooth and difficult to optimize effectively with gradient descent [12].

Single Image Relighting

Instead of using inverse rendering to recover object material parameters which can be relit with physically-based rendering techniques, we train a diffusion model that can directly sample from the distribution of relit images conditioned on a target lighting condition. This diffusion model is essentially a generative single-image relighting model. Early single image relighting techniques employed optimization-based inverse rendering [4]. Subsequent methods trained deep convolutional neural networks to output image geometry, materials, and lighting [26, 27], or in some cases, to directly output relit images [44].

Most related to our method are a few recent works that have trained diffusion models for single image relighting. LightIt [22] trains a model similar to ControlNet [59] to relight outdoor images under arbitrary sun positions conditioned on input normals and shading. DiffusionLight [38] estimates the lighting of an image by using a ControlNet to inpaint the color pixels of a chrome ball in the middle of the scene, from which an environment map can be recovered.

Most similar to our work is the concurrent method of DiLightNet [57] that focuses on single image relighting. DiLightNet uses a ControlNet-based [59] approach to condition a single-image relighting diffusion model on a target environment map. DiLightNet uses a set of “radiance cues” [13] — renderings of the object’s geometry (obtained from an off-the-shelf monocular depth network) with various roughness levels under the target environment illumination — as conditioning. Our method instead focuses on 3D relighting, where multiple of images of an object are available. It uses a similar single-image relighting diffusion model conditioned on radiance cues. Unlike DiLightNet which uses geometry from monocular depth estimation to render radiance cues, we use geometry estimated from the input views using a state-of-the-art surface reconstruction method [52]. This allows our model to better model complex light transport effects such as interreflections caused by occluded geometry.

3 Method

3.1 Problem Formulation

Given a dataset of images of an object and corresponding camera poses $\mathcal{D}=\left\{\left(I_{i},\pi_{i}\right)\right\}_{i=1}^{N}$ , the general goal of relightable 3D reconstruction is to estimate a model with parameters $\theta$ that when rendered, produces relit versions of the dataset under unobserved target illumination $L^{T}$ . This can be expressed as:

\theta^{\star}=\mathop{\mathrm{argmax}}_{\theta}p(\mathcal{D}_{\theta}^{T}|% \mathcal{D}),

(1)

where $\mathcal{D}_{\theta}^{T}=\left\{\left(\operatorname{relight}(I_{i},L^{T},% \theta),\pi_{i}\right)\right\}_{i=1}^{N}$ is a relit version of the original dataset under target illumination $L^{T}$ using model $\theta$ . Note that Eq. (1) only maximizes the likelihood of the original given poses after relighting. However, by using view synthesis, we can then turn the collection of relit images into a 3D representation which can be rendered from arbitrary poses. For brevity, we therefore omit the implicit dependence of $\mathcal{D}^{T}$ in $\theta$ .

This relighting problem has traditionally been solved by using inverse rendering. Inverse rendering techniques do not maximize the probability of the relit renderings, but instead recover a single point estimate of the most likely scene geometry $G$ , materials $M$ , and lighting $L$ (note that this is the “source” lighting condition for the observed images) that together explain the input dataset, and then use physically-based rendering to relight this factorized explanation under the target lighting. Inverse rendering seeks to recover $\theta^{\operatorname{IR}}=\left(G^{\star},M^{\star}\right)$ , where:

\displaystyle G^{\star},M^{\star},L^{\star}=\mathop{\mathrm{argmax}}_{G,M,L}p(% G,M,L|\mathcal{D})=\mathop{\mathrm{argmax}}_{G,M,L}p(\mathcal{D}|G,M,L)p(G,M,L).

(2)

Refer to caption — Figure 2: Overview: Given a set of images $I$ and camera poses $\pi$ in (a), we run NeRF to extract the 3D geometry as in (b). Based on this geometry and a target light shown in (c), we create radiance cues for each given input view as in (d). Next, we independently relight each input image using a Relighting Diffusion Model illustrated in (e) and sample $S$ possible solutions for each given image displayed in (f). Finally, we distill the relit set of images into a 3D representation through a Latent NeRF optimization as in (g) and (h).

The first data likelihood term is computed by physics-based rendering of the estimated model and the second prior term is often factorized into separate handcrafted priors on geometry, materials, and lighting [20, 30, 39].

A relighting approach based on inverse rendering then renders each image $I$ in $\mathcal{D}$ using the recovered geometry and materials, illuminated by the target lighting $L^{T}$ , resulting in $\operatorname{relight}(\mathcal{D},L^{T},\theta^{\operatorname{IR}})$ . This approach has three main issues. First, the differentiable rendering procedures used to compute the gradient of the likelihood term are computationally-expensive. Second, it requires careful modeling of light transport which is cumbersome and existing differentiable renderers do not account for many types of lighting and material effects seen in the real world. Third, there are often ambiguities between $M$ and $L$ , meaning that any errors in their decomposition may be apparent in the relit data. It is quite difficult to design effective handcrafted priors on geometry, materials, and lighting, so inverse rendering procedures frequently recover explanations that have a high data likelihood (are able to render the observed data) but produce clearly incorrect results when re-rendered under different illumination.

3.2 Model Overview

We propose an approach that attempts to maximize the probability of relit images in Eq. (1) without using an explicit physically-based model of the object’s lighting or materials. First, let us introduce a latent variable $Z$ that can be thought of as implicitly representing the input images’ lighting along with the object’s material and geometry parameters. We can then write the likelihood of the relit data as:

\displaystyle p(\mathcal{D}^{T}|\mathcal{D})=\int p(\mathcal{D}^{T},Z|\mathcal% {D})dZ=\int p(\mathcal{D}^{T}|Z,\mathcal{D})p(Z|\mathcal{D})dZ.

(3)

Introducing these latent variables lets us consider all relit renderings in the dataset, $\mathcal{D}^{T}_{i}\triangleq(I^{T}_{i},\pi_{i})$ , as conditionally independent, since the rendering under the target lighting $L^{T}$ is deterministic given the object’s geometry and materials. This enables writing the likelihood as:

\displaystyle p(\mathcal{D}^{T}|\mathcal{D})=\int\underbrace{\Bigg{[}\prod_{i=% 1}^{N}\;p(\mathcal{D}^{T}_{i}|Z_{i},\mathcal{D}_{i})\Bigg{]}}_{\text{latent % NeRF}}\;\underbrace{\vphantom{\Bigg{[}\prod_{i=1}^{N}\;p(\mathcal{D}^{T}_{i}|Z% _{i},\mathcal{D}_{i})\Bigg{]}}p(Z|\mathcal{D})}_{\text{latent prior}}dZ.

(4)

We propose to model this with a latent NeRF model, as used by Martin-Brualla et al. [31] that is able to render novel views under the target illumination for any sampled latent vector. We describe this model in Sec. 3.3. We train this NeRF model by generating a large quantity of sampled relit images with the same target lighting but with different (unknown) latent vectors using a Relighting Diffusion Model which we will describe in Sec. 3.4. In this way, the latent NeRF model effectively distills a large dataset of relit images sampled by the diffusion model into a single 3D representation that can render novel views of the object under the target lighting for any sampled latent.

3.3 Latent NeRF Model

We wish to model the distribution in Eq. (4) in a manner that lets us render images that correspond to relit views of the object for any sampled latent $Z$ . We choose to model this with a latent code NeRF 3D representation, inspired by prior works that condition NeRFs on latent codes to represent sources of variation such as the time of day during capture [31]. This latent NeRF optimizes a set of latent codes that are used to condition the view-dependent color function represented by the NeRF, enabling it to render novel views of the relit object under the target illumination for any sampled latent code. In our implementation, the latent NeRF’s geometry does not depend on the latent code, so the latent code may be interpreted as only representing the object’s material properties.

To optimize the parameters $\theta$ of the latent NeRF model, we maximize the log-likelihood, which by using Eq. (4), can be written as the following maximization problem:

\theta^{\star}=\mathop{\mathrm{argmax}}_{\theta}\operatorname{log}p(\mathcal{D% }_{\theta}^{T}|\mathcal{D})=\mathop{\mathrm{argmax}}_{\theta}\operatorname{log% }\int\left[\prod_{i=1}^{N}p(\mathcal{D}^{T}_{i}|Z_{i},\mathcal{D}_{i})\right]p% (Z|\mathcal{D})dZ.

(5)

Because integrating over all possible latents $Z$ is intractable, we use a heuristic inference strategy and replace the integral with the maximum a posteriori (MAP) estimate of $Z$ :

\theta^{\star}\approx\mathop{\mathrm{argmax}}_{\theta}\max_{Z}\left\{\sum_{i=1% }^{N}\operatorname{log}p(\mathcal{D}^{T}_{i}|Z_{i},\mathcal{D}_{i})+% \operatorname{log}p(Z|\mathcal{D})\right\}.

(6)

By assuming a Gaussian model over the data given the materials, the first term in Eq. (6) is a reconstruction loss over the images. However, since we do not have access to the true latent vector $Z$ , we assume a uniform prior over them, turning the second term in Eq. (6) into a constant. In practice, similar to prior work on NeRFs optimized to generate new views given a dataset containing images with varying appearance, we rely on the NeRF model to resolve any mismatches in the appearance of different images [31]. The minimization of the negative log-likelihood can then be written as:

\theta^{\star}=\operatorname*{arg\,min}_{\theta}\min_{Z}\sum_{i}^{N}\|\mathcal% {D}^{T}_{i}-\operatorname{latent-NeRF}(\theta,Z_{i},\pi_{i})\|^{2}.

(7)

3.4 Relighting Diffusion Model

In order to train the latent NeRF model described in Sec. 3.3, we use a Relighting Diffusion Model (RDM) to generate $S$ samples for each viewpoint from $p(\mathcal{D}^{T}_{i}|\mathcal{D}_{i})$ . In other words, given an input image and target lighting $L^{T}$ , the RDM samples $S$ images corresponding to relit versions of $D_{i}$ that have a high likelihood given the new target light $L^{T}$ . We then associate each sample $s\in\{1,\dots,S\}$ with its own latent code $Z_{i,s}$ and sum over all samples when training the latent NeRF (Eq. (7)).

Figure 3: Example radiance cues for a view of the ‘hotdog’ scene.

Our RDM is implemented as an image denoising diffusion model that is conditioned by the input image and target lighting. To encode the target lighting, we use image-space radiance cues [13, 40, 57], visualized in Fig. 3. These radiance cues are generated by using a simple shading model to render a handful of images of the object’s estimated geometry under the target lighting. This procedure is designed to provide information about the effects of specularities, shadows, and global illumination, without requiring the diffusion network to learn these effects from scratch. In our experiments, we use four different pre-defined materials to render radiance cues: one diffuse material with a pure white albedo, and three purely-specular materials with roughness values $\{0.05,0.13,0.34\}$ . We use GGX [51] as the shading model.

The RDM architecture consists of a pretrained latent image diffusion model, similar to StableDiffusion [41], and uses a ControlNet [59] based approach to condition on the radiance cues. Please refer to Sec. A.3 for more architecture details.

Figure 4: (a) Samples of the Relighting Diffusion Model for the same target environment map, and (b) renderings from the optimized Latent NeRF for a fixed value of the latent. The diffusion samples correspond to different latent explanations of the scene and our latent NeRF optimization is able to effectively optimize these latent variables along with the NeRF model’s parameters to produce consistent renderings for each latent explanation.

4 Experiments

4.1 Experimental Setup

Relighting Dataset

We render objects from Objaverse [11] under varying poses and illuminations. For each object, we randomly sample 4 poses, and render each under 4 different lighting conditions. We represent the lighting as HDR environment maps, and randomly sample from a dataset of 509 environment maps from Polyhaven [56]. For more details, see Sec. A.4.

Evaluation datasets. We evaluate our method on two datasets: TensoIR [20], a synthetic benchmark, and Stanford-ORB [24], a real-world benchmark. TensoIR contains renders of four synthetic objects rendered under six lighting conditions. Following [20], we use the training split of 100 renderings with ‘sunset’ lighting as input $\{I_{i}\}$ . We then evaluate on 200 poses, each of which has renderings under five different environment maps, i.e., ‘bridge’, ‘city’, ‘fireplace’, ‘forest’, and ‘night’, for a total of $4000$ renderings. Stanford-ORB is a real-world benchmark for inverse rendering on data captured in the wild. It contains 14 objects with various materials and captures each object under three different lighting settings, resulting in 42 (object, lighting) pairs. For the task of relighting, we are given images of an object under a single lighting condition and follow the benchmark protocol to evaluate relit images of the object under the two target lighting settings.

Baselines. We compare our method to several existing inverse rendering approaches. On both benchmarks, we compare to NeRFactor [61] and InvRender [62]. On the synthetic benchmark, we additionally compare to TensoIR [20], the current top-performing approach on that benchmark. For the Stanford-ORB benchmark, we additionally compare to PhySG [58], NVDiffRec [35], NeRD [8], NVDiffRecMC [15], and Neural-PBIR [43].

Our model inference. At inference time, since we do not have access to the ground truth relit images, we set the embedding vectors for all views to $Z=0$ when rendering test images. This is in contrast with prior work on latent NeRF models, which usually optimize the embedding vectors to match with (a subset of) the test-set images [31], and is instead consistent with other methods for relighting.

Evaluation metrics. For both benchmarks, we evaluate the quality of 3D relighting by reporting image metrics for rendered images. We report PSNR, SSIM [54], and LPIPS-VGG [60] on low dynamic range (LDR) images. Additionally, we report PSNR on high dynamic range (HDR) images on Stanford-ORB following the benchmark protocol, denoted as PSNR-H while the PSNR on LDR images is marked as PSNR-L. For approaches that do not produce HDR renderings, including ours, we convert the LDR renderings to linear values by using the inverse of the sRGB tone mapping curve. Due to the inherent ambiguities for the relighting task, we follow prior works [20, 24] and apply a channel-wise scale factor to RGB channels to match the ground truth image before computing metrics. Following established evaluation practices on Stanford-ORB, we compute the scale per output image individually whereas for TensoIR we compute a global scale factor that is used for all output images.

Table 1: TensoIR benchmark [20]. We evaluate four objects. Each object has five target lightings, each of which is associated with

200

poses, resulting in evaluating

4000

renderings in total. Best and 2nd-best are highlighted.

	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
NeRFactor [61]	23.383	0.908	0.131
InvRender [62]	23.973	0.901	0.101
TensoIR [20]	28.580	0.944	0.081
Ours	29.709	0.947	0.072

Table 2: Stanford-ORB benchmark [24]. We evaluate 14 objects, each of which was captured under three different lightings. For each (object, lighting) pair, we evaluate renderings of the same object under the other two lightings, resulting in evaluating 836 renderings. †denotes models trained with the ground-truth 3D scans and pseudo materials optimized from light-box captures. Best and 2nd-best are highlighted.

	PSNR-H $\uparrow$	PSNR-L $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
NVDiffRecMC [15]†	25.08	32.28	0.974	0.027
NVDiffRec [35]†	24.93	32.42	0.975	0.027
PhySG [58]	21.81	28.11	0.960	0.055
NVDiffRec [35]	22.91	29.72	0.963	0.039
NeRD [8]	23.29	29.65	0.957	0.059
NeRFactor [61]	23.54	30.38	0.969	0.048
InvRender [62]	23.76	30.83	0.970	0.046
NVDiffRecMC [15]	24.43	31.60	0.972	0.036
Neural-PBIR [43]	26.01	33.26	0.979	0.023
Ours	25.42	32.62	0.976	0.027

4.2 Benchmarking

We report quantitative results on the TensoIR benchmark in Tab. 1, and show qualitative examples in Fig. 5. We significantly outperform all competitor methods quantitatively on all metrics. Visually our method is capable of recovering specular highlights whereas prior methods struggle to model these. Similarly, we report results on Stanford-ORB in Tab. 2 and Fig. 6. Our proposed approach quantitatively improves upon all baselines, except those of Neural-PBIR [43], indicating the effectiveness of IllumiNeRF in real world scenarios. Note that although Neural-PBIR achieves better metrics than us, Fig. 6 shows that their relighting results are mostly diffuse, even for highly-glossy objects, and that they lack many of the strong specular highlights that our method is able to recover. This behavior of their model may explain their better metrics despite worse qualitative performance, because the illumination maps provided by Stanford-ORB do not correspond to the incident illumination at the object’s location, since they were captured using a light probe which was moved for each image in the dataset [24]. This means that even given perfect materials and geometry, the images relit by any method cannot match with the true captured images, which is most noticeable in specular highlights. This mismatch penalizes methods like ours, which recover such specularities, over ones that recover mostly diffuse appearance with no apparent specular highlights [48].

Table 3: Ablations. We conduct ablation studies on the Hotdog scene from TensoIR [20]. We evaluate renderings of 200 novel test camera poses, each under five target environment map lighting conditions, resulting in evaluating 1000 renderings in total. Best is highlighted.

S Latent PSNR $\uparrow$ SSIM $\uparrow$ LPIPS $\downarrow$ 1 ✗ 24.957 0.921 0.099 1 ✓ 26.321 0.925 0.097 4 ✓ 27.409 0.936 0.087 16 ✓ 27.950 0.939 0.082

4.3 Ablations

We evaluate ablations of our model on TensoIR’s hotdog scene in Tab. 3, and visualize them in Fig. 7. We reach the following conclusions: 1) The latent NeRF model is essential: optimizing a standard NeRF cannot reconcile variations across views, even if we only generate a single sample per viewpoint for optimization ( $S=1$ ). 2) More diffusion samples help: by increasing $S$ , the number of samples from the RDM per viewpoint, we observe consistent improvements across almost all metrics. This corroborates our intuition that using an increased number of samples helps the latent NeRF effectively fit the target distribution (Eq. (4)) in a more stable way.

4.4 Limitations

Our model relies on high quality geometry estimated by UniSDF [52] (see Sec. A.1) to provide sufficiently good radiance cues for conditioning the RDM (Sec. 3.4) . Any missing structure will lead our model to miss specular reflections, as seen on the top left of the salt can result in Fig. 6’s second column. Errors in geometry also affect the quality of synthesized novel views, e.g. the missing thin branches from the plant in Fig. 5. Our approach is not suited for real-time relighting, as it requires generating new samples with the RDM and optimizing a NeRF for any new target lighting condition.

5 Conclusion

We have proposed a new paradigm for the task of relightable 3D reconstruction. Instead of decomposing an object’s appearance into lighting and material factors and only then relighting the object with physically-based rendering, we use a Relighting Diffusion Model (RDM) to sample a varied collection of proposed relit images given a target illumination, and distill these samples into a single consistent 3D latent NeRF representation. This 3D representation can then be rendered to synthesize novel views of the object under the target lighting. Perhaps surprisingly, this paradigm consistently outperforms existing inverse rendering methods on synthetic and real-world object relighting benchmarks. This new paradigm’s success is likely due to the RDM’s ability to generate a large number of proposals for the new relit image. This is in contrast with prior work for relighting based on inverse rendering, which first estimates a single material model and then uses it for relighting, since errors in material estimation may propagate to the relit images. We believe that this paradigm may be used to improve data capture, material and lighting estimation, and that it may be used to do so robustly on real-world data.

Acknowledgements

We would like to thank Ben Poole and Ruiqi Gao for insightful discussions. We thank Yunzhi Zhang and Zhengfei Kuang for providing their qualitative results for the Stanford-ORB [24] baseline, and Haian Jin for the TensoIR [20] baseline results. We are also grateful to Abhijit Kundu and Henna Nandwani for their infrastructure support.

References

Aanæs et al. [2016] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. Large-Scale Data for Multiple-View Stereopsis. IJCV, 2016.
Bangaru et al. [2022] S. Bangaru, M. Gharbi, T.-M. Li, F. Luan, K. Sunkavalli, M. Hasan, S. Bi, Z. Xu, G. Bernstein, and F. Durand. Differentiable Rendering of Neural SDFs through Reparameterization. In SIGGRAPH Asia, 2022.
Bangaru et al. [2023] S. Bangaru, L. Wu, T.-M. Li, J. Munkberg, G. Bernstein, J. Ragan-Kelley, F. Durand, A. Lefohn, and Y. He. SLANG.D: Fast, Modular and Differentiable Shader Programming. ACM TOG, 2023.
Barron and Malik [2014] J. T. Barron and J. Malik. Shape, Illumination, and Reflectance from Shading. TPAMI, 2014.
Barron et al. [2021] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In CVPR, 2021.
Barron et al. [2023] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. In ICCV, 2023.
Bi et al. [2020] S. Bi, Z. Xu, P. Srinivasan, B. Mildenhall, K. Sunkavalli, M. Hašan, Y. Hold-Geoffroy, D. Kriegman, and R. Ramamoorthi. Neural Reflectance Fields for Appearance Acquisition. ArXiv, 2020.
Boss et al. [2021] M. Boss, R. Braun, V. Jampani, J. T. Barron, C. Liu, and H. P. A. Lensch. NeRD: Neural Reflectance Decomposition from Image Collections. In ICCV, 2021.
Bradbury et al. [2018] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: Composable Transformations of Python+NumPy Programs, 2018. URL http://github.com/google/jax.
Debevec et al. [2000] P. Debevec, T. Hawkins, C. Tchou, H.-P. Duiker, W. Sarokin, and M. Sagar. Acquiring the Reflectance Field of a Human Face. In ACM CGIT, 2000.
Deitke et al. [2023] M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A Universe of Annotated 3D Objects. In CVPR, 2023.
Fischer and Ritschel [2023] M. Fischer and T. Ritschel. Plateau-Reduced Differentiable Path Tracing. In CVPR, 2023.
Gao et al. [2020] D. Gao, G. Chen, Y. Dong, P. Peers, K. Xu, and X. Tong. Deferred Neural Lighting: Free-viewpoint Relighting from Unstructured Photographs. ACM TOG, 2020.
Greff et al. [2022] K. Greff, F. Belletti, L. Beyer, C. Doersch, Y. Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. D. Liu, H. Meyer, Y. Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V. Sitzmann, A. Stone, D. Sun, S. Vora, Z. Wang, T. Wu, K. M. Yi, F. Zhong, and A. Tagliasacchi. Kubric: a scalable dataset generator. In CVPR, 2022.
Hasselgren et al. [2022] J. Hasselgren, N. Hofmann, and J. Munkberg. Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising. In NeurIPS, 2022.
Heek et al. [2023] J. Heek, A. Levskaya, A. Oliver, M. Ritter, B. Rondepierre, A. Steiner, and M. van Zee. Flax: A Neural Network Library and Ecosystem for JAX, 2023. URL http://github.com/google/flax.
Hendrycks and Gimpel [2016] D. Hendrycks and K. Gimpel. Gaussian Error Linear Units (GELUs). ArXiv, 2016.
Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion Probabilistic Models. In NeuIPS, 2020.
Jakob et al. [2022] W. Jakob, S. Speierer, N. Roussel, and D. Vicini. Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering. ACM TOG, 2022.
Jin et al. [2023] H. Jin, I. Liu, P. Xu, X. Zhang, S. Han, S. Bi, X. Zhou, Z. Xu, and H. Su. TensoIR: Tensorial Inverse Rendering. In CVPR, 2023.
Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv, 2014.
Kocsis et al. [2024] P. Kocsis, J. Philip, K. Sunkavalli, M. Nießner, and Y. Hold-Geoffroy. LightIt: Illumination Modeling and Control for Diffusion Models. In CVPR, 2024.
Kuang et al. [2022] Z. Kuang, K. Olszewski, M. Chai, Z. Huang, P. Achlioptas, and S. Tulyakov. NeROIC: Neural Rendering of Objects from Online Image Collections. ACM TOG, 2022.
Kuang et al. [2023] Z. Kuang, Y. Zhang, H.-X. Yu, S. Agarwala, E. Wu, J. Wu, et al. Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark. In NeurIPS, 2023.
Li et al. [2018a] T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable Monte Carlo Ray Tracing through Edge Sampling. ACM TOG, 2018a.
Li et al. [2018b] Z. Li, Z. Xu, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker. Learning to reconstruct shape and spatially-varying reflectance from a single image. ACM TOG, 2018b.
Li et al. [2020] Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker. Inverse Rendering for Complex Indoor Ccenes: Shape, Spatially-varying Lighting and SVBRDF from a Single Image. In CVPR, 2020.
Lorensen and Cline [1987] W. E. Lorensen and H. E. Cline. Marching Cubes: A High Resolution 3D Surface Construction Algorithm. In ACM CGIT, 1987.
Loubet et al. [2019] G. Loubet, N. Holzschuch, and W. Jakob. Reparameterizing Discontinuous Integrands for Differentiable Rendering. ACM TOG, 2019.
Mai et al. [2023] A. Mai, D. Verbin, F. Kuester, and S. Fridovich-Keil. Neural Microfacet Fields for Inverse Rendering. In ICCV, 2023.
Martin-Brualla et al. [2021] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, 2021.
Merlin Nimier-David and Sébastien Speierer and Benoît Ruiz and Wenzel Jakob [2020] Merlin Nimier-David and Sébastien Speierer and Benoît Ruiz and Wenzel Jakob. Radiative backpropagation: An adjoint method for lightning-fast differentiable rendering. ACM TOG, 2020.
Mildenhall et al. [2020] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.
Müller et al. [2022] T. Müller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 2022.
Munkberg et al. [2022] J. Munkberg, J. Hasselgren, T. Shen, J. Gao, W. Chen, A. Evans, T. Müller, and S. Fidler. Extracting Triangular 3D Models, Materials, and Lighting From Images. In CVPR, 2022.
Oechsle et al. [2021] M. Oechsle, S. Peng, and A. Geiger. UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction. In ICCV, 2021.
Pharr et al. [2023] M. Pharr, W. Jakob, and G. Humphreys. Physically Based Rendering: From Theory to Implementation. MIT Press, 2023.
Phongthawee et al. [2023] P. Phongthawee, W. Chinchuthakun, N. Sinsunthithet, A. Raj, V. Jampani, P. Khungurn, and S. Suwajanakorn. Diffusionlight: Light probes for free by painting a chrome ball. In ArXiv, 2023.
Ramamoorthi and Hanrahan [2001] R. Ramamoorthi and P. Hanrahan. A Signal-Processing Framework for Inverse Rendering. In ACM CGIT, 2001.
Ren et al. [2011] P. Ren, J. Wang, J. M. Snyder, X. Tong, and B. Guo. Pocket Reflectometry. ACM TOG, 2011.
Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
Srinivasan et al. [2021] P. P. Srinivasan, B. Deng, X. Zhang, M. Tancik, B. Mildenhall, and J. T. Barron. Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In CVPR, 2021.
Sun et al. [2023] C. Sun, G. Cai, Z. Li, K. Yan, C. Zhang, C. S. Marshall, J.-B. Huang, S. Zhao, and Z. Dong. Neural-PBIR Reconstruction of Shape, Material, and Illumination. In ICCV, 2023.
Sun et al. [2019] T. Sun, J. T. Barron, Y.-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevec, and R. Ramamoorthi. Single Image Portrait Relighting. ACM TOG, 2019.
Tang et al. [2024] J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu. LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. ArXiv, 2024.
[46] The Blender Foundation. Blender 2.93. URL https://www.blender.org/.
Verbin et al. [2022] D. Verbin, P. Hedman, B. Mildenhall, T. E. Zickler, J. T. Barron, and P. P. Srinivasan. Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields. In CVPR, 2022.
Verbin et al. [2024a] D. Verbin, B. Mildenhall, P. Hedman, J. T. Barron, T. Zickler, and P. P. Srinivasan. Eclipse: Disambiguating Illumination and Materials using Unintended Shadows. In CVPR, 2024a.
Verbin et al. [2024b] D. Verbin, P. P. Srinivasan, P. Hedman, B. Mildenhall, B. Attal, R. Szeliski, and J. T. Barron. NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections. ArXiv, 2024b.
Vicini et al. [2022] D. Vicini, S. Speierer, and W. Jakob. Differentiable Signed Distance Function Rendering. ACM TOG, 2022.
Walter et al. [2007] B. Walter, S. Marschner, H. Li, and K. E. Torrance. Microfacet Models for Refraction through Rough Surfaces. In Rendering Techniques, 2007.
Wang et al. [2023] F. Wang, M.-J. Rakotosaona, M. Niemeyer, R. Szeliski, M. Pollefeys, and F. Tombari. UniSDF: Unifying Neural Representations for High-Fidelity 3D Reconstruction of Complex Scenes with Reflections. ArXiv, 2023.
Wang et al. [2021] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In NeurIPS, 2021.
Wang et al. [2004] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
Yariv et al. [2021] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman. Volume Rendering of Neural Implicit Surfaces. In NeurIPS, 2021.
Zaal et al. [2021] G. Zaal, R. Tuytel, R. Cilliers, J. R. Cock, A. Mischok, S. Majboroda, D. Savva, and J. Burger. Polyhaven: a Curated Public Asset Library for Visual Effects Artists and Game Designers, 2021.
Zeng et al. [2024] C. Zeng, Y. Dong, P. Peers, Y. Kong, H. Wu, and X. Tong. DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation. In SIGGRAPH, 2024.
Zhang et al. [2021a] K. Zhang, F. Luan, Q. Wang, K. Bala, and N. Snavely. PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting. In CVPR, 2021a.
Zhang et al. [2023] L. Zhang, A. Rao, and M. Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
Zhang et al. [2018] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
Zhang et al. [2021b] X. Zhang, P. P. Srinivasan, B. Deng, P. E. Debevec, W. T. Freeman, and J. T. Barron. Nerfactor. In ACM TOG, 2021b.
Zhang et al. [2022] Y. Zhang, J. Sun, X. H. He, H. Fu, R. Jia, and X. Zhou. Modeling Indirect Illumination for Inverse Rendering. In CVPR, 2022.

Appendix – IllumiNeRF: 3D Relighting without Inverse Rendering

Appendix A Additional Implementation Details

A.1 Latent NeRF Model and Geometry Estimator

We use JAX [9] to implement both the geometry estimator and Latent NeRF model as UniSDF [52], a state-of-the-art volume rendering approach based on a signed distance function (SDF). The advantage of using UniSDF is that it enables easily extracting a mesh from the SDF, which we can then import into a standard rendering engine such as Blender [46] in order to compute radiance cues. Additionally, UniSDF decouples geometry from appearance, allowing us to fix the weights related to geometry and only optimize for weights that model the appearance.

Our parameterization of the UniSDF model is similar to the one used in the original paper for the DTU dataset [1], with four key changes. First, we reduce the number of rounds of proposal sampling (as introduced by mip-NeRF 360 [5]) from two to one, using 64 proposal samples. Second, we use the asymmetric predicted normal loss from NeRF-Casting [49]:

\displaystyle\mathcal{L}_{p}=\sum_{i}\left(\lambda_{1}\omega_{i}\|\cancel{% \nabla}\mathbf{n}_{i}-\cancel{\nabla}\mathbf{n}_{i}^{\prime}\|^{2}+\right.% \left.\lambda_{2}\cancel{\nabla}\omega_{i}\|\mathbf{n}_{i}-\cancel{\nabla}% \mathbf{n}_{i}^{\prime}\|^{2}+\lambda_{3}\cancel{\nabla}\omega_{i}\|\cancel{% \nabla}\mathbf{n}_{i}-\mathbf{n}_{i}^{\prime}\|^{2}\right),

(S1)

where $\omega_{i}$ is the volume rendering weight of the $i$ -th sample, $\cancel{\nabla}$ denotes the stop-gradient operator, $\mathbf{n}_{i}$ and $\mathbf{n}_{i}^{\prime}$ are the $i$ -th sample’s density normals and predicted normals respectively (see [47]), and we set $\lambda_{1}=\lambda_{2}=10^{-3}$ , $\lambda_{3}=10^{-2}$ . Third, like NeRF-Casting [49], we use an additional hash grid encoding [34] with 15 scales between a resolution of $32$ and $4096$ , used only for outputting predicted normals. Fourth, we further encourage the local smoothness of the predicted normals $\mathbf{n}^{\prime}$ by using a smoothness loss similar to [36, 61]:

\mathcal{L}_{s}=\lambda_{4}\sum_{i}\omega_{i}\|\mathbf{n}^{\prime}(\mathbf{x}_% {i}+\bm{\varepsilon})-\mathbf{n}^{\prime}(\mathbf{x}_{i})\|^{2},

(S2)

where $\mathbf{x}_{i}$ is the 3D position of the $i$ -th sample, and $\bm{\varepsilon}\sim\mathcal{N}(0,\sigma^{2}I)$ is an isotropic Gaussian random variable used to perturb the sample locations. We set $\lambda_{4}=0.1$ and $\sigma=0.01$ .

We find that these modifications result in better and smoother geometry necessary for our model’s ability to relight objects with specular highlights.

Finally, to incorporate the GLO embeddings, we utilize an MLP to predict an element-wise scale and shift value to be applied to the ‘bottleneck’ feature of UniSDF, similar to AffineGLO in Zip-NeRF [6].

For both geometry estimation and latent NeRF optimization, we utilize the Adam [21] optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.99$ , and $\varepsilon=1\times 10^{-15}$ . We decay our learning rate logarithmically from $5\times 10^{-3}$ to $5\times 10^{-4}$ over 25k training iterations with cosine-scheduled warmup in the first 500 steps. We optimize with a batch size of $2\times 10^{14}$ rays. The optimization takes around 45 minutes with 16 A100 40GB GPUs.

A.2 Radiance Cues

Geometry

To extract radiance cues we first optimize UniSDF [52] on the input images. After optimization, we convert the SDF representation to a mesh using marching cubes [28] with threshold set to be zero.

Rendering

We use Blender Cycles [46], a physically-based path-tracer to render the radiance cues. We run Blender via the Kubric python wrapper [14], and we use the estimated geometry with the predefined materials based on the GGX material model [51], as described in Sec. 3.4.

Shading normals

In order to produce smoothly-varying specular highlights which look realistic, we need the normals used for shading to be smooth. By default, Blender computes normals for shading based on the input geometry, which may be noisy. To mitigate this, we can feed the predicted normals $\mathbf{n}^{\prime}$ described in Sec. A.1 to Blender and enable its shading normal smoothing function which applies to the predicted normals, and uses them for shading. However, over-smoothness may harm the photorealism of the rendered shadows. See Fig. S3 for qualitative comparison on radiance cues rendered without enabling the shading normal smoothing (Fig. S3) and with the feature enabled (Fig. S3). In our implementation, we exploit a hybrid strategy: we utilize radiance cues without smoothness for the diffuse material and use radiance cues with smoothness for the specular materials. Concretely, our final radiance cues are composed of the first rendering in Fig. S3 and the right three ones in Fig. S3.

A.3 Relighting Diffusion Model

Figure S1: w/o smoothness.

Figure S2: w/ smoothness.

Figure S3: Effects of shading normal smoothing function.

We implement our relighting diffusion model in JAX [9]. We illustrate the architecture of the model for inference in Fig. S4. We build upon a text-to-image latent diffusion model which is similar to the model of Rombach et al. [41]. It denoises gaussian noise of size $64\times 64\times 8$ and decodes the output latent features into a relit image of size $512\times 512\times 3$ . The model was not conditioned on text input, receiving only empty strings via a CLIP text encoder. During training the base model is frozen.

Following ControlNet [59], we create a trainable copy of the base diffusion model’s UNet encoder and middle blocks and append them with a ZeroConv-based blocks to the frozen base model. The given masked image and radiance cues are first fed through ConvNet 2 (see Fig. 4 in [57] for details) and ConvNet 3 (see Tab. S2). The resulting output is added to the output of the latent noise, which is fed through ConvNet 4. ConvNet 4 consists of a single convolution layer with kernel size 3, stride 1, padding 1, and 320 output channels. Given that the trainable copy was designed for tokenized text input, the masked image is first fed through ConvNet 1 to generate representative embeddings. To ensure compatibility between the output of ConvNet 1 (size 64) and the CLIP encoder’s text output shape, zero-valued tensors are appended, increasing the size to 77.

Table S1: Fig. S4’s ConvNet 1 Structure. Convolution layer’s definition is represented as (kernel size, stride, padding). We use SiLU [17] as the activation function between layers. Layer 8 uses zero initialization while the other layers use Flax’s [16] default initialization²²2https://github.com/google/flax/blob/144486b5fa7b3dfb/flax/core/nn/linear.py#L27. In our implementation, we have

H=W=512

Index Layer Output Shape 0 (input) - $H\times W\times 4$ 1 (3, 1, 1) $H\times W\times 16$ 2-1 (3, 1, 1) $H\times W\times 16$ 2-2 (3, 2, 1) $H/2\times W/2\times 32$ 3-1 (3, 1, 1) $H/2\times W/2\times 32$ 3-2 (3, 2, 1) $H/4\times W/4\times 64$ 4-1 (3, 1, 1) $H/4\times W/4\times 64$ 4-2 (3, 2, 1) $H/8\times W/8\times 128$ 5-1 (3, 1, 1) $H/8\times W/8\times 128$ 5-2 (3, 2, 1) $H/16\times W/16\times 256$ 6-1 (3, 1, 1) $H/16\times W/16\times 256$ 6-2 (3, 2, 1) $H/32\times W/32\times 512$ 7-1 (3, 1, 1) $H/32\times W/32\times 512$ 7-2 (3, 2, 1) $H/64\times W/64\times 512$ 8 (3, 1, 1) $H/64\times W/64\times 1024$ 9 flatten $(H/64\times W/64)\times 1024$

Table S2: Fig. S4’s ConvNet 3 Structure. Convolution layer’s definition is represented as (kernel size, stride, padding). We use SiLU [17] as the activation function between layers. Layer 5 uses zero initialization while the other layers uses Flax [16] default initializationFootnote 2. In our implementation, we have

H=W=512

Index Layer Output Shape 0 (input) - $H\times W\times 12$ 1 (3, 1, 1) $H\times W\times 16$ 2-1 (3, 1, 1) $H\times W\times 16$ 2-2 (3, 2, 1) $H/2\times W/2\times 32$ 3-1 (3, 1, 1) $H/2\times W/2\times 32$ 3-2 (3, 2, 1) $H/4\times W/4\times 96$ 4-1 (3, 1, 1) $H/4\times W/4\times 96$ 4-2 (3, 2, 1) $H/8\times W/8\times 256$ 5 (3, 1, 1) $H/8\times W/8\times 320$

We train the diffusion model using an approach similar to ControlNet [59], with a large dataset of synthetic objects rendered under multiple lighting conditions. Each training example for fine-tuning consists of a pair of images that view the same object with the same camera parameters, illuminated by two different environment map (see Sec. A.4). We fine-tune the diffusion model to predict one of these two images, given the other image as well as the corresponding radiance cues rendered using the synthetic object’s geometry. Note that for synthetic objects, we do not need to estimate the geometry $G$ nor to enable the Blender normal smoothing function to compute the radiance cues since we already have the ground-truth meshes and the normals from synthetic objects are smooth enough. We fine-tune the base model for 150k steps using batch size of 256 examples and a learning rate of $10^{-4}$ , which is linearly warmed up from 0 over the first 1k steps. The fine-tuning takes around 2 days on 32 TPUv5 chips. Besides, we always use the empty string as the text input to effectively make the fine-tuned model image-based.

At inference time, we use the DPPM scheduler [18] without classifier-free guidance to produce samples at $512\times 512$ resolution.

A.4 Training Data Processing

We use Objaverse [11] as the synthetic dataset. To filter out low-quality objects, we use the list from [45] to get our initial set of 156,330 ones.³³3https://github.com/ashawkey/objaverse_filter/tree/dc9e7cd0df8626f30df02bb By additionally removing (semi-)transparent ones, we have a final set of 152,649 objects. If the object only contains geometry, we manually assign a homogeneous texture (ShaderNodeBsdfDiffuse) with a color uniformly sampled from $[0,1]^{3}$ . Further, if the object does not have the material information, we assign it a Blender Glossy BSDF material (ShaderNodeBsdfGlossy), whose roughness value is uniformly sampled from $[0.02,0.5]$ and base color is set to be the same as the homogeneous texture. The mixing factor between the specular and diffuse materials (ShaderNodeMixShader) is uniformaly sampled from $[0,1]$ .

As we discussed in Sec. A.3, our diffusion training requires image pairs under different lightings. For this, we select 509 equirectangular environment maps from [56]. For each object, we sample four camera poses on a sphere centered around it. For each camera, we randomly sample two environment maps and augment them with random horizontal shift, vertical flip, and RGB channel shuffle. We then use Blender’s Cycle path tracer to render an image of resolution $512\times 512$ with 512 samples per pixel for each environment map using a camera whose focal length is set to be 512.