HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.01950v2 [cs.CV] 06 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Durham university, UK22institutetext: IHPC, A*STAR, Singapore33institutetext: Tencent Jarvis Research Center, China
https://xingy038.github.io/ConRF/

ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields

Xingyu Miao 11    Yang Bai 22    Haoran Duan 11    Fan Wan 11    Yawen Huang 33   
Yang Long
11
   Yefeng Zheng 33
Abstract

Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.

Keywords:
NeRF, Style transfer, VGG

1 Introduction

The utilization of 3D implicit neural radiation fields has led to significant progress in generating realistic scene representations[35], and one of the ongoing challenges in this field pertains to the application of various artistic styles in controlling the representations. Style transfer for 2D images based on neural networks has been extensively studied [13, 21, 12, 45, 30, 40], and state-of-the-art methods enable zero-shot arbitrary style transfer [19, 38, 49, 6]. Most recently, style transfer for 3D modality has attracted increasing research attention, where the related style is changing with the shift of 2D image statistics [18, 20, 36]. Existing image-driven 3D scene stylization methods lie in primarily two types of style transfer methods: zero-shot arbitrary style transfer method needs to be trained once to perform any style transfer, whereas arbitrary style transfer method necessitates retraining for each new style. Zero-shot 3D style transfer has been more popular since it only requires side information as the style like reference (e.g., images or texts) without retraining for each new style.

Several works have been proposed to explore the zero-shot 3D style transfer [27, 5]. Although these methods can give continuous and beautiful 3D artistic scenes, they all need to give specific style references. Users who want to transfer a 3D scene to a specific art style usually need to spend time searching for a suitable reference image. Therefore, more recent work explores text-based style transfer [24, 3, 11], because text can more precisely convey the desired style without relying solely on the use of reference images. However, there is no recent work on 3D zero-shot style transfer using text as a reference style, and most work only supports arbitrary style transfer. CLIPstyler [24] is a pioneer in text-based style transfer, however, it can only be applied to 2D images and needs to be retrained for each text, which is inefficient and impractical. In the realm of 3D implicit neural fields, CLIP-NeRF [46] stands as a groundbreaking achievement, which can use text to control changes in 3D scenes. While this innovative approach empowers the manipulation of 3D scenes, its capabilities are limited to the manipulation of color attributes exclusively, lacking the capacity to imbue artworks with distinctive stylistic elements. Although CLIP [39] has excellent image-text matching performance, it is not specially designed for style matching, thus directly leveraging CLIP will lack such information to a certain extent. Moreover, it is impractical to fine-tune CLIP for the style transfer task, because artistic style is a subjective feeling and there is no uniform standard.

Refer to caption
Figure 1: Zero-shot 3D style transfer of single condition. Given a set of multi-view content images of a 3D scene, ConRF can transfer an arbitrary text reference style or an arbitrary image reference style to the 3D scene in a zero-shot manner.

Our goal is to map the CLIP features space to the style space, simplifying the use of text or images as references to convey style. In this work, we introduce ConRF to achieve control over zero-shot style transfer of the neural radiation field. We leverage the encoder of CLIP to extract features, which are subsequently transformed into corresponding style features as input. These style features are then employed to transfer the style of the 3D scene. In order to achieve zero-shot transfer capabilities, we only use images to train the model in the training phase. In the inference phase, we can use text or images for inference with the help of CLIP text-image matching capabilities. To this end, we use a mapping network to map the feature embedding obtained from the CLIP encoder to the VGG feature space widely used in 2D style transfer tasks. In addition, we introduce a novel 3D selection volume to allow local style transfer to be controlled via text. In this way, our method can perform style transfer in a variety of ways, including using content text and style text or content text and style image to perform style transfer on 3D scenes. To summarize, the main contributions of this work are as follows:

  • We propose a novel method that leverages CLIP for zero-shot 3D scene artistic style transfer by a single condition (i.e. image or text).

  • We introduce a mapping network to alleviate the ambiguity in CLIP features related to style.

  • We present a 3D selection volume that allows for localized style manipulation within 3D scenes, expanding the possibilities in scene stylization and manipulation.

2 Related Work

2.1 Style Transfer on Image and Video

Artistic image stylization is a long-term problem for computer vision. The primary objective is to naturally blend the content features from one image with the style features from another, resulting in a unique image that seamlessly incorporates both elements. The early traditional method is to use handmade methods to simulate style [15, 16]. With the development of deep learning, [12] is the first work that uses CNN to separate and manipulate the content and style of images for style transfer. Since then, style transfer based on neural networks has gradually evolved from single-style transfer to multi-style transfer, and now to arbitrary style transfer [7, 19, 25, 26, 29, 38, 42, 48]. In terms of video style transfer, enforcing the temporal smoothness constraints defined on the optical flow can successfully transfer the style of the video [2, 17, 41, 47]. Nevertheless, all stylization methods, whether for images or videos, currently rely on a predetermined reference style, imposing substantial constraints on the creative possibilities.

2.2 Style Transfer on NeRF

Recently, neural implicit representation methods, such as NeRF [28, 33, 50, 35, 51, 52, 53, 1, 31, 8, 32], have shown great potential for high-quality rendering. NeRF leverages multi-layer perceptrons (MLPs) to implicitly model continuous scenes, leading to impressive results in view synthesis when rendering scenes. The emergence of NeRF has made significant progress in the neural network representation of 3D scenes. More recently, several works [5, 20, 37, 54, 57, 27, 4, 10] combine NeRF with neural style transfer [13] to handle artistic 3D scene stylization, [22, 14] are based on the diffusion model and can use text to stylize 3D scenes. In particular, [5, 37, 54] incorporate style loss functions from 2D images to fine-tune pre-trained NeRF models. Meanwhile, [27, 20] refine pre-trained NeRF models using a 2D style transfer model based on previous work. The approach presented in [4] employs a pre-trained 2D realistic network to constrain the realistic styles of different views and style images within a 3D scene. It’s worth noting, however, that the primary focus of this method is on transferring only the color tone of the style image. SNeRF [37] addresses the memory limitations associated with whole-image training in NeRF and enhances visual quality by employing a cyclic process of stylization and NeRF training. On the other hand, ARF [54] and Ref-NPR [56] aim to transfer detailed style features through matching features between style images and scenes. However, both SNeRF, ARF, and Ref-NPR entail a time-intensive optimization process for each reference style. While [10] and [20] leverage latent coding to achieve commendable results in style migration, their methods are limited to known styles and cannot adapt to unseen ones. [5] can achieve arbitrary style transfer by implicitly injecting style information into MLP parameters. Similar to [4], it can only transfer the color tone of the style image. Most of the neural style transfer-based methods mentioned above necessitate the use of style reference images to steer the style transfer, whereas our approach can leverage text or image for style transfer. To accomplish text style transfer, we utilize CLIP-encoded text embeddings to manipulate the scene for style transfer, leveraging only the already aligned text images in the pre-trained CLIP latent space. Furthermore, our work only focuses on artistic style transfer and does not have the creative capabilities based on generative models.

3 Method

In this section, we introduce ConRF, a neural radiant field method for image-text style transfer. Our goal is to utilize text and images to infuse the overall style into a 3D scene effectively. ConRF achieves this through pre-trained CLIP. However, the text-image feature space of CLIP usually has ambiguity, as illustrated in Fig. 2. To alleviate this problem, we propose to leverage the feature space of the widely adopted pre-trained VGG model [43] in style transfer tasks as a priori knowledge to facilitate the transformation of the CLIP feature space into a style feature space. This approach enables us to utilize pre-trained CLIP directly for style information extraction.

Refer to caption
Figure 2: Mitigating the ambiguity of CLIP features via mapping module. The feature obtained from the CLIP extractor shows the clear high-level expression resulting in highly similar distributions for similar sunflowers or Shiba Inu with different styles, which leads to the lack of fine level features (eg. textures). On the contrary, the VGG features can better reveal the differences between the same sunflower or Shiba Inu images with different styles. To alleviate this problem, we use a mapping module to map CLIP’s feature space into a style space. Features in the style space reduce this ambiguity and encourage differentiation among similar feature distributions.

In addition, we propose a multi-spatial feature extraction strategy to achieve local 3D scene style transfer. This strategy enhances the adaptability of the 3D CLIP volume’s image-level function to pixel-level queries, ensuring precise text-to-3D scene midpoint matching. The training process of ConRF is shown in Fig. 3, and the inference process is shown in Fig. 4.

3.1 Preliminaries

Following previous work [27], we leverage the featured NeRF model to represent 3D scenes. In contrast to the original NeRF framework [35], for every queried 3D position x3𝑥superscript3x\in\mathbb{R}^{3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, not only an RGB color c𝑐citalic_c but also furnishes a volume density σ(x)𝜎𝑥\sigma(x)italic_σ ( italic_x ) and a multi-channel feature vector F(x)C𝐹𝑥superscript𝐶F(x)\in\mathbb{R}^{C}italic_F ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C signifies the number of feature channels. Subsequently, we compute the feature representation for any rays r intersecting the volume by performing integration over sampled points along the ray, utilizing an approximated volume rendering technique [35]:

F(𝐫)=i=1NwiFi,𝐹𝐫superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖F(\textbf{r})=\sum_{i=1}^{N}w_{i}F_{i},italic_F ( r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (1)
wherewi=exp(j=1i1σjδj)(1exp(σiδi)),wheresubscript𝑤𝑖expsuperscriptsubscript𝑗1𝑖1subscript𝜎𝑗subscript𝛿𝑗1expsubscript𝜎𝑖subscript𝛿𝑖\text{where}\quad w_{i}=\text{exp}\left(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j}% \right)(1-\text{exp}(-\sigma_{i}\delta_{i})),where italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = exp ( - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - exp ( - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (2)

where σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refer to the volume density and feature attributes of the sampled point i𝑖iitalic_i, with wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicating the weighting of Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the ray r, and δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denoting the distance between adjacent samples. In addition, we use the same style transfer module to transfer the content style of pre-trained feature NeRF, which can be expressed as:

Fcs=Fc×σI+w𝐫×μI,subscript𝐹𝑐𝑠subscript𝐹𝑐subscript𝜎𝐼subscript𝑤𝐫subscript𝜇𝐼F_{cs}=F_{c}\times\sigma_{I}+w_{\textbf{r}}\times\mu_{I},italic_F start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT r end_POSTSUBSCRIPT × italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , (3)
whereFc=i=1NwiFi,w𝐫=i=1Nwi,𝐫formulae-sequencewheresubscript𝐹𝑐superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖formulae-sequencesubscript𝑤𝐫superscriptsubscript𝑖1𝑁subscript𝑤𝑖𝐫\text{where}\quad F_{c}=\sum_{i=1}^{N}w_{i}F_{i},w_{\textbf{r}}=\sum_{i=1}^{N}% w_{i},\textbf{r}\in\mathcal{R}where italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , r ∈ caligraphic_R (4)

where σIsubscript𝜎𝐼\sigma_{I}italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT denotes standard-deviation and μIsubscript𝜇𝐼\mu_{I}italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT mean of style image ,wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the weight assigned to sampled point i𝑖iitalic_i, Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for the feature associated with sample i𝑖iitalic_i, and \mathcal{R}caligraphic_R denotes the set of rays within each training batch.

Refer to caption
Figure 3: The pipeline of ConRF. ConRF performs style transfer on pre-trained feature NeRF. It consists of two branches: VGG and CLIP, which use the same style of transmission modules and share a decoder. The VGG branch uses pre-trained VGG19 [43] to extract style features, and weakly supervises the CLIP branch using a CLIP image encoder to extract features to optimize the mapping module of the CLIP branch to separate style features. Finally, these two branches jointly optimize the decoder to obtain a stylized image. Additionally, to achieve the purpose of local transmission, we optimize an additional branch for the featured NeRF.

3.2 Global stylization

CLIP [39] represents a recent breakthrough in the realm of language-image pre-training. It employs two encoders, extensively trained on textual language, to construct a text-image embedding space. This space serves as a bridge connecting textual and image features, encompassing a diverse array of visual concepts. Given its ability to establish meaningful connections between text and image attributes, a natural question arises: Can this feature space be harnessed to facilitate text-based style transfer? On the one hand, because CLIP is not specifically designed for style transfer tasks, it poses challenges in directly decoupling content and style features. On the other hand, directly fine-tuning CLIP for style transfer tasks is impractical due to the highly subjective nature of style, and the absence of corresponding fine-tuning data. To this end, we propose a novel method to project text-image space onto a subspace of style features to facilitate the effective exchange of text or images with style features, so as to achieve the purpose of using text or image to control style transfer.

Project CLIP space to style space

It is challenging to create a text representation that matches styles in the text image space of CLIP because of insufficient data. Therefore, constructing a direct mapping from text style representation to style features through text input is not an effective method. However, features of text and images are shared in text-image space, we can learn such a mapping by using the style representation of images. To this end, we leverage a mapping module denoted as fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to facilitate the process. Initially, we utilize the CLIP image encoder Esubscript𝐸E_{\mathcal{I}}italic_E start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT to extract the feature vector Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from the style image Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

Fs=E(Is).subscript𝐹𝑠subscript𝐸subscript𝐼𝑠F_{s}=E_{\mathcal{I}}(I_{s}).italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) . (5)

Subsequently, we utilize the mapping module fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to establish a correspondence between the CLIP text-image feature vectors Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the style feature representation F^ssubscript^𝐹𝑠\hat{F}_{s}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

F^s=fθ(Fs),subscript^𝐹𝑠subscript𝑓𝜃subscript𝐹𝑠\hat{F}_{s}=f_{\theta}(F_{s}),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (6)

following prior research [27, 49], styles are typically characterized by mean (μIsubscript𝜇𝐼\mu_{I}italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT) and standard-deviation (σIsubscript𝜎𝐼\sigma_{I}italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT), which can be expressed as:

(σI,μI)=F^s.subscript𝜎𝐼subscript𝜇𝐼subscript^𝐹𝑠(\sigma_{I},\mu_{I})=\hat{F}_{s}.( italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) = over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (7)

Consequently, the content style can be transferred utilizing the style module, in accordance with Eq. 4.

Refer to caption
Figure 4: The inference of ConRF. After the training phase, ConRF is equipped to apply 3D stylistic transformations directly using text or images. Additionally, users can input specific content selection prompts to control the stylized region.
Weakly supervise for optimizing style space

In previous 2D studies, they usually utilize a pre-trained VGG [43] encoder to extract features from style images, which is widely employed in style transfer. To enhance the capabilities of our mapping module and enable the projection of text-image space into the style space, we leverage pre-trained VGG [43] to weakly supervise. To this end, we employ a two-branch approach to optimize the mapping module. The VGG branch takes as input the features from the style transfer module and utilizes a pre-trained VGG [43] to extract style features. Meanwhile, the CLIP branch takes as input the features that the mapping module has projected. To train the CLIP branch, we introduce style feature loss to ensure consistency between the output of the mapping module and the style features of the VGG branch, which can be expressed as:

f=FsvFsc^22,subscript𝑓superscriptsubscriptnormsuperscriptsubscript𝐹𝑠𝑣^superscriptsubscript𝐹𝑠𝑐22\mathcal{L}_{f}=\left\|F_{s}^{v}-\hat{F_{s}^{c}}\right\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ∥ italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT - over^ start_ARG italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)
whereFsv=(σv,μv),Fsc^=(σc,μc),formulae-sequencewheresuperscriptsubscript𝐹𝑠𝑣subscript𝜎𝑣subscript𝜇𝑣^superscriptsubscript𝐹𝑠𝑐subscript𝜎𝑐subscript𝜇𝑐\text{where}\quad F_{s}^{v}=(\sigma_{v},\mu_{v}),\hat{F_{s}^{c}}=(\sigma_{c},% \mu_{c}),where italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , over^ start_ARG italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG = ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (9)

where Fsvsuperscriptsubscript𝐹𝑠𝑣F_{s}^{v}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT denotes that the VGG branch uses ReLU_3_1 layer of the pre-trained VGG [43] extract style features form style image, F^scsuperscriptsubscript^𝐹𝑠𝑐\hat{F}_{s}^{c}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents the output of the mapping module of the CLIP branch, and σ,μ𝜎𝜇\sigma,\muitalic_σ , italic_μ represent the standard-deviation and mean respectively. Thus, the fsubscript𝑓\mathcal{L}_{f}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is:

f=σvσc22+μvμc22.subscript𝑓superscriptsubscriptnormsubscript𝜎𝑣subscript𝜎𝑐22superscriptsubscriptnormsubscript𝜇𝑣subscript𝜇𝑐22\mathcal{L}_{f}=\left\|\sigma_{v}-\sigma_{c}\right\|_{2}^{2}+\left\|\mu_{v}-% \mu_{c}\right\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ∥ italic_σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (10)

Furthermore, we adopt the identical loss function employed in prior studies [49, 27, 19] for both the VGG and CLIP branches to train the shared weight decoder. Specifically, the content loss, denoted as csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, is computed as the Mean Squared Error (MSE) of the feature map, while the style loss, represented as ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, is computed as the MSE of the mean and standard deviation of the channel features:

stylized=c+λs,subscript𝑠𝑡𝑦𝑙𝑖𝑧𝑒𝑑subscript𝑐𝜆subscript𝑠\mathcal{L}_{stylized}=\mathcal{L}_{c}+\lambda\mathcal{L}_{s},caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y italic_l italic_i italic_z italic_e italic_d end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (11)

where λ𝜆\lambdaitalic_λ controls the balance between content preservation and stylization effects. It’s important to note that both csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT incorporate from both the VGG and CLIP branches. Finally, in order to reduce the uncertainty of the optimized decoder caused by using different feature extractors for the VGG and CLIP branches, we use a consistency loss, denoted as consis=IsvIsc1subscript𝑐𝑜𝑛𝑠𝑖𝑠subscriptnormsuperscriptsubscript𝐼𝑠𝑣superscriptsubscript𝐼𝑠𝑐1\mathcal{L}_{consis}=\left\|I_{s}^{v}-I_{s}^{c}\right\|_{1}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s italic_i italic_s end_POSTSUBSCRIPT = ∥ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where Isvsuperscriptsubscript𝐼𝑠𝑣I_{s}^{v}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is the final stylized image of the VGG branch, and Iscsuperscriptsubscript𝐼𝑠𝑐I_{s}^{c}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the final stylized image of the CLIP branch.

Refer to caption
Figure 5: Comparison with four SOTA 3D style transfer methods using reference style images. For the four scenes in the example our method produces significantly better 3D style transfer.

3.3 Local stylization

Prior work [27] allowed different styles on different objects. However, this approach necessitated the provision of precise masks in advance to regulate the styles for achieving the desired combinations. Taking inspiration from recent promising prompt-based methods, we introduce a simple approach using text prompts to apply styles directly to specific 3D regions.

Utilizing CLIP features weakly supervise a 3D selection volume

Given a multi-view image depicting a scene and a selected textual content description, our objective is to execute partial style transfer on the reconstructed NeRF. This process assigns a matching style to each designated 3D point. To achieve this, we exploit the multimodal capabilities of the CLIP model to associate each 3D point with CLIP features and their corresponding semantic information. Specifically, we introduce an additional branch to get 3D CLIP feature volume for rendering the CLIP feature, we can then render the CLIP feature of each ray r using volume rendering [35]:

F(𝐫)CLIP=i=1NwiFi.𝐹subscript𝐫CLIPsuperscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖F(\textbf{r})_{\text{CLIP}}=\sum_{i=1}^{N}w_{i}F_{i}.italic_F ( r ) start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (12)

Next, we can weakly supervise to optimize the CLIP features from NeRF rendering. We employ the CLIP image encoder to obtain content features Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and optimize them using the L1𝐿1L1italic_L 1 Loss:

CLIP=F(𝐫)CLIPFcon1.subscriptCLIPsubscriptnorm𝐹subscript𝐫CLIPsubscript𝐹𝑐𝑜𝑛1\mathcal{L}_{\text{CLIP}}=\left\|F(\textbf{r})_{\text{CLIP}}-F_{con}\right\|_{% 1}.caligraphic_L start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT = ∥ italic_F ( r ) start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (13)

However, since CLIP only generates image-level features, when using text for query, it can only match image-level features but not pixel-level features. Therefore, we segment training images into patches of varying sizes, extract CLIP features from these patches, and subsequently integrate these CLIP features to derive multi-spatial features and enable the 3D selection volume to learn the ability to be used for pixel-level feature queries.

Multi-spatial feature extraction strategy

We suggest an innovative approach that relies on multi-spatial strategies to enhance the adaptability of CLIP’s image-level features for pixel-level queries. The multi-spatial component is designed to extract features from patches, each containing pixels located at distinct positions. To achieve this, we employ the sliding window algorithm for multi-spatial feature extraction and then calculate the average of these multi-spatial features. First, we initialize two tensors 𝒞𝒞\mathcal{C}caligraphic_C and \mathcal{F}caligraphic_F to record the counts and features within the sliding window. We adopt a sliding window approach to partition the input image. Within each window, we extract image data and use a CLIP image encoder to obtain features, denoted as E(x,y)subscript𝐸𝑥𝑦E_{\mathcal{I}}(x,y)italic_E start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( italic_x , italic_y ), where x𝑥xitalic_x and y𝑦yitalic_y represent pixel positions. The features and counts for each window are cumulatively updated in the 𝒞𝒞\mathcal{C}caligraphic_C and \mathcal{F}caligraphic_F via:

𝒞(x,y)+=1,limit-from𝒞𝑥𝑦1\mathcal{C}(x,y)+=1,caligraphic_C ( italic_x , italic_y ) + = 1 , (14)
(x,y)+=E(x,y).limit-from𝑥𝑦subscript𝐸𝑥𝑦\mathcal{F}(x,y)+=E_{\mathcal{I}}(x,y).caligraphic_F ( italic_x , italic_y ) + = italic_E start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( italic_x , italic_y ) . (15)

Finally, the normalized feature representation is obtained, which is expressed as:

normal(x,y)=(x,y)𝒞(x,y),subscript𝑛𝑜𝑟𝑚𝑎𝑙𝑥𝑦𝑥𝑦𝒞𝑥𝑦\mathcal{F}_{normal}(x,y)=\frac{\mathcal{F}(x,y)}{\mathcal{C}(x,y)},caligraphic_F start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT ( italic_x , italic_y ) = divide start_ARG caligraphic_F ( italic_x , italic_y ) end_ARG start_ARG caligraphic_C ( italic_x , italic_y ) end_ARG , (16)

where normalsubscript𝑛𝑜𝑟𝑚𝑎𝑙\mathcal{F}_{normal}caligraphic_F start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT encapsulates the comprehensive average feature representation of the entire image, integrating information from all windows.

Refer to caption
Figure 6: Comparison with two SOTA style transfer methods using text prompt on Synthetic NeRF. For the two scenes in the example, our approach produces 3D style transfers that are significantly closer to textual descriptions.

3.4 Inference process

As shown in Fig. 4, once training is completed, we can use text to perform style transfer on 3D scenes. Specifically, since text and image features are shared, we can get the mean and variance of the style when we input text, that is,

(σT,μT)=fθ(E𝒯(Ts)),subscript𝜎𝑇subscript𝜇𝑇subscript𝑓𝜃subscript𝐸𝒯subscript𝑇𝑠(\sigma_{T},\mu_{T})=f_{\theta}(E_{\mathcal{T}}(T_{s})),( italic_σ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) , (17)

where Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes the text input, E𝒯subscript𝐸𝒯E_{\mathcal{T}}italic_E start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT denotes the CLIP text encoder and (σT,μT)subscript𝜎𝑇subscript𝜇𝑇(\sigma_{T},\mu_{T})( italic_σ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is the style representation obtained from the text description. Thus, using CLIP for style transfer can be expressed as:

Fcs=i=1Nwi(Fi×σ(F^s)+μ(F^s)).subscript𝐹𝑐𝑠superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖𝜎subscript^𝐹𝑠𝜇subscript^𝐹𝑠F_{cs}=\sum_{i=1}^{N}w_{i}(F_{i}\times\sigma(\hat{F}_{s})+\mu(\hat{F}_{s})).italic_F start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_σ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_μ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) . (18)

Eq. 18 can be interpreted as the individual transfer of style to every sampling point along the ray prior to volume rendering. Ultimately, we employ a decoder to reproject these stylized features into RGB space, thereby yielding stylized novel views. In addition, our local style transfer feature is optional and becomes active only once a content text prompt is entered. The local style transfer branch is activated, we can use a content text prompt and Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT calculate the similarity z𝑧zitalic_z:

z=cosFc,FCLIP,𝑧𝑐𝑜𝑠subscript𝐹𝑐subscript𝐹CLIPz=cos\left\langle F_{c},F_{\text{CLIP}}\right\rangle,italic_z = italic_c italic_o italic_s ⟨ italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT ⟩ , (19)

where cos,cos\left\langle,\right\rangleitalic_c italic_o italic_s ⟨ , ⟩ is the cosine similarities. We can then get the mask M𝑀Mitalic_M:

m=zt,𝑚𝑧𝑡m=z\leq t,italic_m = italic_z ≤ italic_t , (20)

t𝑡titalic_t is the threshold. Thuse Eq. 18 can be replace:

Fcslocal=Mi=1Nwi(Fi×σ(F^s1)+μ(F^s2))+(1M)i=1Nwi(Fi×σ(F^s2)+μ(F^s2)).superscriptsubscript𝐹𝑐𝑠local𝑀superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖𝜎subscript^𝐹𝑠1𝜇subscript^𝐹𝑠21𝑀superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝐹𝑖𝜎subscript^𝐹𝑠2𝜇subscript^𝐹𝑠2\begin{split}&F_{cs}^{\text{local}}=M\sum_{i=1}^{N}w_{i}(F_{i}\times\sigma(% \hat{F}_{s1})+\mu(\hat{F}_{s2}))\\ &+(1-M)\sum_{i=1}^{N}w_{i}(F_{i}\times\sigma(\hat{F}_{s2})+\mu(\hat{F}_{s2})).% \end{split}start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT local end_POSTSUPERSCRIPT = italic_M ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_σ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT ) + italic_μ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 1 - italic_M ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_σ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s 2 end_POSTSUBSCRIPT ) + italic_μ ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s 2 end_POSTSUBSCRIPT ) ) . end_CELL end_ROW (21)

Our method supports both reference image style transfer and text prompt style transfer, enabling versatile multi-combination style transfers. After inputting content text prompts, our method allows for the use of various prompt combinations to transform different aspects. It supports combinations such as text-text, image-image, and text-image for style transfers.

Refer to caption
Figure 7: Local style transfer results. We use the text prompt to select content, and then use images and text to perform style transfer on different areas.

4 Experiment

In this section, we assessed the proposed method through both qualitative and quantitative experiments. Given the versatility of our approach, which supports style transfer with text and image references, we conducted a comparison against the SOTA methods in these two aspects.

4.1 Qualitative Experiments

We conducted a comprehensive evaluation of ConRF using two publicly available datasets: LLFF [34], which comprises real scenes with intricate geometry, and Synthetic NeRF [33], featuring 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT views of objects. For methods that rely on reference-style images, we benchmarked against the state-of-the-art StyleRF [27], Ref-NPR [56], SNeRF [37], and ARF [54]. In the case of methods employing text prompts, ConRF was evaluated against the 2D single-text condition method, CLIPStyler [24], and a 3D text control editing method, CLIP-NeRF [46].

Methods
Short-range
Consistency
Long-range
Consistency
SSIM LPIPS SSIM LPIPS
Ref-NPR [56] 0.782 0.028 0.337 0.088
SNeRF [37] 0.852 0.029 0.401 0.107
ARF [54] 0.816 0.056 0.336 0.148
StyleRF [27] 0.495 0.059 0.050 0.217
Our 0.763 0.041 0.442 0.088
CLIPStyler*superscriptCLIPStyler\text{CLIPStyler}^{*}CLIPStyler start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [24] 0.606 0.048 0.356 0.254
CLIP-NeRF*superscriptCLIP-NeRF\text{CLIP-NeRF}^{*}CLIP-NeRF start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [46] 0.641 0.043 0.389 0.162
𝐎𝐮𝐫*superscript𝐎𝐮𝐫\text{{Our}}^{*}Our start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 0.787 0.038 0.524 0.091
Table 1: Results on consistency. We compare ConRF with the state-of-the-art on consistency using LPIPS (normal-↓\downarrow) and SSIM(normal-↑\uparrow). (***) is using a text prompt to style transfer. The best score is red, and the second score is blue.

The qualitative comparisons are shown in Fig. 5 and Fig. 6. For methods using reference-style images, we can see that compared with other methods, our method transfers the color and artistic style of the reference-style image while maintaining relatively complete original content. Although the single style transfer methods SNeRF [37] and ARF [54] show better transfer results compared to our method (as shown in the Trex scenario), we transfer style features in localized regions as well and can perform arbitrary style transfers without retraining. In addition, for the Ref-NPR [56], which is highly dependent on the existing 2D transmission method. In order to make the comparison fair, we directly use the style image as the reference style for transfer. In terms of methods using text prompts, compared with CLIPStyler [24] and CLIP-NeRF [46], our method is closer to text description. Although our method can achieve excellent results, we observe that our method can be further improved in some aspects, which are described in the limitations of supplementary material. We present the locally transferred image in Fig. 7. Our method presents the ability to utilize text prompts effectively, allowing for direct and precise control over specific sections of content. The selection volume empowers users to effortlessly convey a wide range of distinct styles within localized areas. At present, our approach serves as an initial exploration, but it holds great promise.

Refer to caption
Figure 8: Ablation studies. (a) shows the stylization without the mapping module, (b) shows the stylization without style feature loss, (c) shows the stylization without consistency loss, and (d) shows the stylization of our full pipeline.

4.2 Quantitative Results

At present, 3D style transfer stands as a nascent and scarcely explored domain, with limited metrics available for quantitatively evaluating stylization quality. Therefore, following [5, 4] we compared the consistency of multiple views, we compared the short-range and long-range consistency scores of adjacent views and distant views respectively. In our experiments, we warp one view to another based on optical flow [44] and then compute masked SSIM and LPIPS scores [55] to measure the stylization consistency.

As shown in Tab. 1, we compared with six methods. Compared to our baseline StyleRF [27], our short-range LPIPS score is improved by 31%, and long-range LPIPS score is improved by 59.4%. Please note that SNeRF [37] and Ref-NPR [56] outperform our method. This is primarily due to the fact that in certain scenarios, the style transfer effect of SNeRF [37] may not be as pronounced, whereas Ref-NPR [56] directly transfers the style image’s characteristics into the content. For methods using text prompts, we are completely superior to both methods.

Refer to caption
Figure 9: Ablation studies for the multi-spatial feature extraction strategy. The text prompt is ’a flower,’ and the mask is generated using a selection volume constructed with different window sizes. When size=32𝑠𝑖𝑧𝑒32size=32italic_s italic_i italic_z italic_e = 32, the mask is the cleanest.

4.3 Ablation Studies

To evaluate the effectiveness of our approach, we perform ablation experiments separately for global style transfer and local style transfer functions.

Golbal stylization

In Fig. 8, we present a comparison between our full system and its variants, where we removed specific modules individually: A) mapping module, B) style feature loss, and C) consistency loss. As illustrated, the absence of a mapping module impedes the transfer of style from any reference image. The lack of a style feature loss compromises the quality of the transferred style, even though some elements of the reference picture’s style may still be obtained. Moreover, the inclusion of a consistency loss noticeably enhances the performance of style transfer.

Local stylization

We control local style transfer through text prompts. As illustrated in Fig. 9, we conduct a comprehensive evaluation of the Multi-spatial feature extraction strategy by experimenting with different sliding window sizes. Our observations indicate that utilizing a window size of 32 leads to the generation of a notably refined mask, effectively achieving our goal of fine-grained style control in localized areas.

Refer to caption
Figure 10: Limitation. While the red circle contains textual style information, the resultant overall outcome is unsatisfactory.

4.4 Limitations

Our model benefits from CLIP, while also facing certain limitations imposed by it. For example, as shown in Figure 10, we were not able to migrate the blue text style well because we used CLIP directly without making any changes to it. Our work currently only focuses on artistic style transfer and does not have creative capabilities based on generative models. We expect future work to successfully address them. Additionally, our local style transfer currently only works in face-forwarding scenes.

5 Conclusion

In this work, we introduce ConRF, a novel approach for achieving zero-shot stylization using either image or text as a reference. ConRF is mapping the CLIP features space to the VGG style space, enabling style transfer within the feature space of the scene. It can generate high-quality stylization novel views. Furthermore, we utilize a 3D volume to create local transfer effects, offering potential applications in artistic 3D design.

References

  • [1] Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5855–5864 (2021)
  • [2] Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1105–1114 (2017)
  • [3] Chen, J., Ji, B., Zhang, Z., Chu, T., Zuo, Z., Zhao, L., Xing, W., Lu, D.: Testnerf: Text-driven 3d style transfer via cross-modal learning
  • [4] Chen, Y., Yuan, Q., Li, Z., Liu, Y., Wang, W., Xie, C., Wen, X., Yu, Q.: Upst-nerf: Universal photorealistic style transfer of neural radiance fields for 3d scene. arXiv preprint arXiv:2208.07059 (2022)
  • [5] Chiang, P.Z., Tsai, M.S., Tseng, H.Y., Lai, W.S., Chiu, W.C.: Stylizing 3d scene via implicit representation and hypernetwork. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1475–1484 (2022)
  • [6] Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., Xu, C.: Stytr2: Image style transfer with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11326–11336 (2022)
  • [7] Deng, Y., Tang, F., Dong, W., Sun, W., Huang, F., Xu, C.: Arbitrary style transfer via multi-adaptation network. In: Proceedings of the 28th ACM international conference on multimedia. pp. 2719–2727 (2020)
  • [8] Duan, H., Long, Y., Wang, S., Zhang, H., Willcocks, C.G., Shao, L.: Dynamic unary convolution in transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
  • [9] small yellow duck, W.K.: Painter by numbers (2016), https://kaggle.com/competitions/painter-by-numbers
  • [10] Fan, Z., Jiang, Y., Wang, P., Gong, X., Xu, D., Wang, Z.: Unified implicit neural stylization. In: European Conference on Computer Vision. pp. 636–654. Springer (2022)
  • [11] Fu, T.J., Wang, X.E., Wang, W.Y.: Language-driven artistic style transfer. In: European Conference on Computer Vision. pp. 717–734. Springer (2022)
  • [12] Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
  • [13] Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–2423 (2016)
  • [14] Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: Editing 3d scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
  • [15] Hertzmann, A.: Painterly rendering with curved brush strokes of multiple sizes. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques. pp. 453–460 (1998)
  • [16] Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. p. 327–340. SIGGRAPH ’01, Association for Computing Machinery, New York, NY, USA (2001). https://doi.org/10.1145/383259.383295, https://doi.org/10.1145/383259.383295
  • [17] Huang, H., Wang, H., Luo, W., Ma, L., Jiang, W., Zhu, X., Li, Z., Liu, W.: Real-time neural style transfer for videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 783–791 (2017)
  • [18] Huang, H.P., Tseng, H.Y., Saini, S., Singh, M., Yang, M.H.: Learning to stylize novel views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13869–13878 (2021)
  • [19] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)
  • [20] Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18342–18352 (2022)
  • [21] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. pp. 694–711. Springer (2016)
  • [22] Kamata, H., Sakuma, Y., Hayakawa, A., Ishii, M., Narihira, T.: Instruct 3d-to-3d: Text instruction guided 3d-to-3d conversion. arXiv preprint arXiv:2303.15780 (2023)
  • [23] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [24] Kwon, G., Ye, J.C.: Clipstyler: Image style transfer with a single text condition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18062–18071 (2022)
  • [25] Li, X., Liu, S., Kautz, J., Yang, M.H.: Learning linear transformations for fast image and video style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3809–3817 (2019)
  • [26] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. Advances in neural information processing systems 30 (2017)
  • [27] Liu, K., Zhan, F., Chen, Y., Zhang, J., Yu, Y., El Saddik, A., Lu, S., Xing, E.P.: Stylerf: Zero-shot 3d style transfer of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8338–8348 (2023)
  • [28] Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. Advances in Neural Information Processing Systems 33, 15651–15663 (2020)
  • [29] Liu, S., Lin, T., He, D., Li, F., Wang, M., Li, X., Sun, Z., Li, Q., Ding, E.: Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6649–6658 (2021)
  • [30] Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4990–4998 (2017)
  • [31] Miao, X., Bai, Y., Duan, H., Huang, Y., Wan, F., Long, Y., Zheng, Y.: Ctnerf: Cross-time transformer for dynamic neural radiance field from monocular video. arXiv preprint arXiv:2401.04861 (2024)
  • [32] Miao, X., Bai, Y., Duan, H., Huang, Y., Wan, F., Xu, X., Long, Y., Zheng, Y.: Ds-depth: Dynamic and static depth estimation via a fusion cost volume. IEEE Transactions on Circuits and Systems for Video Technology (2023)
  • [33] Mildenhall, B., Hedman, P., Martin-Brualla, R., Srinivasan, P.P., Barron, J.T.: Nerf in the dark: High dynamic range view synthesis from noisy raw images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16190–16199 (2022)
  • [34] Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi, R., Ng, R., Kar, A.: Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38(4), 1–14 (2019)
  • [35] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [36] Mu, F., Wang, J., Wu, Y., Li, Y.: 3d photo stylization: Learning to generate stylized novel views from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16273–16282 (2022)
  • [37] Nguyen-Phuoc, T., Liu, F., Xiao, L.: Snerf: stylized neural implicit representations for 3d scenes. arXiv preprint arXiv:2207.02363 (2022)
  • [38] Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5880–5888 (2019)
  • [39] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [40] Risser, E., Wilmot, P., Barnes, C.: Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893 (2017)
  • [41] Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos and spherical images. International Journal of Computer Vision 126(11), 1199–1219 (2018)
  • [42] Sheng, L., Lin, Z., Shao, J., Wang, X.: Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8242–8250 (2018)
  • [43] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [44] Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 402–419. Springer (2020)
  • [45] Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: Feed-forward synthesis of textures and stylized images. arXiv preprint arXiv:1603.03417 (2016)
  • [46] Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3835–3844 (2022)
  • [47] Wang, W., Xu, J., Zhang, L., Wang, Y., Liu, J.: Consistent video style transfer via compound regularization. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12233–12240 (2020)
  • [48] Wu, X., Hu, Z., Sheng, L., Xu, D.: Styleformer: Real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14618–14627 (2021)
  • [49] Wu, Z., Zhu, Z., Du, J., Bai, X.: Ccpl: contrastive coherence preserving loss for versatile style transfer. In: European Conference on Computer Vision. pp. 189–206. Springer (2022)
  • [50] Xian, W., Huang, J.B., Kopf, J., Kim, C.: Space-time neural irradiance fields for free-viewpoint video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9421–9431 (2021)
  • [51] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B., Lin, D.: Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII. pp. 106–122. Springer (2022)
  • [52] Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
  • [53] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4578–4587 (2021)
  • [54] Zhang, K., Kolkin, N., Bi, S., Luan, F., Xu, Z., Shechtman, E., Snavely, N.: Arf: Artistic radiance fields. In: European Conference on Computer Vision. pp. 717–733. Springer (2022)
  • [55] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
  • [56] Zhang, Y., He, Z., Xing, J., Yao, X., Jia, J.: Ref-npr: Reference-based non-photorealistic radiance fields for controllable scene stylization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4242–4251 (2023)
  • [57] Zhang, Z., Liu, Y., Han, C., Pan, Y., Guo, T., Yao, T.: Transforming radiance field with lipschitz network for photorealistic 3d scene stylization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20712–20721 (2023)

6 Supplementary

6.1 Implementation details

Our model is trained on a single RTX 3090 GPU using approximately 80,000 samples from the WiKiArt dataset [9] as style images. During the training phase, the Adam optimizer [23] is employed with a learning rate of 1e-4. We assign the loss weights for style and content as 20 and 1, respectively, while all other loss weights are maintained at 1. For incorporating CLIP [39], we utilize the pre-trained ViT-B/32 model.

Mapping network architecture

We present the architecture of the mapping network, the details please see Fig. 11.

Refer to caption
Figure 11: Mapping network architecture. Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the style feature, which consists of the mean and standard deviation obtained through mapping the CLIP feature (FCLIPsubscript𝐹CLIPF_{\text{CLIP}}italic_F start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT). Meanwhile, Fs^normal-^subscript𝐹𝑠\hat{F_{s}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG denotes the stylized feature.
Method Typ Need Retrain Local Transfer Guided with Image Guided with Text
ARF [54] S
SNeRF [37] S
Ref-NPR [56] M
StyleRF [27] A
CLIP-NeRF [46] A
Our A
Table 2: Comparison of 3D style transfer methods. We compared other types of 3D art style transfer methods. S=Single-Style transfer, M=Multi-Style transfer, and A=Arbitrary Style transfer.

6.2 User study

We conduct a user study, following previous research [27], where we compare our approach to a 3D style transfer baseline. We engage 25 participants with diverse demographic backgrounds, including different professions, age ranges, and ethnicities, to ensure a broad perspective. Each participant is shown a series of stylizations that include original scene videos, their stylized counterparts, and videos processed using both our ConRF method and the baseline technique.

Participants are tasked with two evaluations: identifying the stylized video that most accurately reflects the style of a given image, and selecting the video that exhibits the best multi-view consistency. For a comprehensive analysis, we prepare 24 unique scene-style pairings, which we then randomly distribute into 6 categories. Each participant evaluates one category, providing us with a diverse set of opinions on the effectiveness of our approach. The aggregated findings from this study are detailed in Table 2, as referenced in our report.

Methods Stylization Score Consistency Score
Ref-NPR [56] 0.473 0.766
SNeRF [37] 0.526 0.696
ARF [54] 0.450 0.795
StyleRF [27] 0.510 0.800
Our 0.693 0.843
CLIPStyler*superscriptCLIPStyler\text{CLIPStyler}^{*}CLIPStyler start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [24] 0.605 0.333
CLIP-NeRF*superscriptCLIP-NeRF\text{CLIP-NeRF}^{*}CLIP-NeRF start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [46] 0.560 0.920
𝐎𝐮𝐫*superscript𝐎𝐮𝐫\text{{Our}}^{*}Our start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 0.620 0.880
Table 3: User study. We compare ConRF with the state-of-the-art 3D style transfer baseline. (***) is using a text prompt to style transfer. The best score is red, and the second score is blue.
Refer to caption
Figure 12: Comparison with StyleRF [27] with different views on LLFF. Here, we present the flower scene style transfer results.

6.3 Additional Experiment

We present more qualitative results here and video results in the supplementary material.

6.3.1 Qualitative results

Text prompt style transfer comparison on LLFF database

As shown in Fig. 13, we additionally perform evaluations on the LLFF [35].

Refer to caption
Figure 13: Comparison with two SOTA style transfer methods using text prompt on LLFF. Here, we present the horns and room scenes style transfer results.
Refer to caption
Figure 14: Local style transfer. In the main text, we present the transfer results of the text-image combination, where we show the transfer results of text-text and image-image.
Local style transfer quantitative results

As outlined in the main text, our model is versatile, supporting text-image, text-text, and image-image combinations to generate diverse stylization outcomes. The text-text and image-image combinations, which demonstrate this flexibility, are exemplified in Figure 3 of the supplementary materials (Fig. 14).

More quantitative results

To demonstrate the performance of our method, as shown in the Fig. 12, Fig. 15, and Fig. 16, here we provide more visualization results.

Refer to caption
Figure 15: Style transfer result on Synthetic NeRF dataset. We present the text and image style transfer results of the hotdog scene.
Refer to caption
Figure 16: Style transfer result on Synthetic NeRF dataset. We present the text and image style transfer results of the mic scene.