HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.06793v2 [cs.CV] 19 Mar 2024

Boosting Image Restoration via Priors from Pre-trained Models

Xiaogang Xu1,2,3123{}^{1,2,3}start_FLOATSUPERSCRIPT 1 , 2 , 3 end_FLOATSUPERSCRIPT  Shu Kong5,6,7567{}^{5,6,7}start_FLOATSUPERSCRIPT 5 , 6 , 7 end_FLOATSUPERSCRIPT  Tao Hu3,838{}^{3,8}start_FLOATSUPERSCRIPT 3 , 8 end_FLOATSUPERSCRIPT  Zhe Liu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT111Corresponding author.  Hujun Bao1,414{}^{1,4}start_FLOATSUPERSCRIPT 1 , 4 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Zhejiang Lab  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT CUHK  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT RealityEdge  44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Zhejiang University  55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT University of Macau
66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT Institute of Collaborative Innovation  77{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT Texas A&M University  88{}^{8}start_FLOATSUPERSCRIPT 8 end_FLOATSUPERSCRIPT National University of Singapore
[email protected], [email protected], [email protected]
[email protected], [email protected]
Abstract

Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size (<<<1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

1 Introduction

Image restoration plays a vital role in real-world scenarios, aiming to reconstruct high-quality images by eliminating degradations. It has broad applications in various fields, such as denoising [43, 44] and low-light enhancement [42, 41] for improving smartphone-captured photos. While effective restoration networks have been proposed [19, 52], the inherently ill-posed nature of image restoration makes it challenging to achieve significant improvements by merely modifying network structures. Simply increasing model parameters does not guarantee better results, as the model may tend to overfit to the training data.

Refer to caption
Figure 1: Our method leverages pre-trained models, such as CLIP and Stable Diffusion, and significantly improves image restoration across various tasks. More results on different tasks/models can be seen in experiments. Pre-trained models are involved during the training and not required during the inference.
Refer to caption
Figure 2: We present a lightweight plugin, pre-training guided refining module (PTG-RM), to leverage pre-trained models for enhancing image restoration. The desired prior is the OFS 𝒢(Id)𝒢subscript𝐼𝑑\mathcal{G}(I_{d})caligraphic_G ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). It has two components, PTG spatial varying enhancement (PTG-SVE), and PTG channel-spatial attention (PTG-CSA). Fig. 3 depicts their details. Our PTG-RM significantly improves restoration in various tasks as listed in the top-right (see quantitative results previewed in Fig. 1).

Restoration performance relies on strong image priors, such as the novel level of denoising [38] or the blur kernel in deblurring [14, 50]. However, estimating these priors is challenging, especially with real-world data. Some approaches utilize physical variables as priors, like depth information [46] and semantic features [41, 1, 36] derived from pre-trained networks. Nevertheless, these physical variables are not robust enough since the dense depth/semantic prediction networks do not have sufficient generalization ability among different scenes in restoration tasks. As a result, employing them requires complex and specific mechanisms, limiting their applicability across various tasks. In this paper, we propose a novel approach that extracts degradation-related information from pre-trained models (with various training objectives) exposed to different degradation during pre-training, all without requiring explicit annotations.

Motivation. Two types of pre-trained models may contain degradation-related information during training: restoration models, and pre-trained models on large-scale data (e.g., CLIP [27], BLIP [16], and BLIP2 [17]). Using the former is evident, but models trained with some types of degradation may not effectively help restore images with other types of degradation. Using the latter remains unexplored. CLIP-IQA [33] finds that CLIP features contain degradation-related information and so be useful for image assessment, while no restoration approaches have been proposed yet. Existing pre-trained multi-modality models may have been trained on various degraded images. Presumably, restoration-related annotations are unavailable during pre-training, their resulted features likely contain valuable information for image restoration. The key is to leverage such information to help the target restoration learning. However, the heterogeneity of pre-trained models and restoration models poses difficulties in using the off-the-shelf features extracted from pre-trained models.

Technical novelty. We introduce a novel pre-training guided refinement module (PTG-RM) that leverages off-the-shelf features (OSF) computed by a pre-trained model 𝒢𝒢\mathcal{G}caligraphic_G to improve image restoration tasks. The PTG-RM \mathcal{R}caligraphic_R is a lightweight plugin (Fig. 2) (additional \mathcal{R}caligraphic_R has <<<1M parameters in total). PTG-RM enables us to determine optimal operation ranges and spatial-channel attention, thus facilitating image restoration. It takes as input the initially enhanced image from \mathcal{F}caligraphic_F, the input image, and its OSF extracted by a pre-trained model. It is trained with \mathcal{F}caligraphic_F (using the same loss as \mathcal{F}caligraphic_F) and adaptively enhances it. PTG-RM \mathcal{R}caligraphic_R consists of two components: Pre-Train-Guided Spatial Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA).

PTG-SVE employs spatial-varying operations to refine the initially enhanced results differently from region to region. Unlike previous methods [42] that rely on fixed references to determine optimal operation ranges, we establish a spatial-aware learnable mapping for OSF and utilize the mapped features as spatial-wise guidance. This adaptively fuses the features extracted from short- and long-range operations, allowing different regions to be refined appropriatel and yielding more effective enhancement.

Following PTG-SVE, PTG-CSA further enhances the results by formulating effective channel- and spatial-attention with OSF. We note that different areas may require varying degrees of feature correctness via the attention mechanism. Hence, we propose to generate spatial-varying convolution kernels to synthesize the spatial weights. Our approach tailors the attention process to different regions.

Contributions. We make three major contributions.

  • We present a novel and general method that leverages pre-trained models to enhance various restoration tasks. Our work opens up possibilities for improving performance across various domains.

  • We propose a novel paradigm that leverages pre-trained priors to formulate effective neural operation ranges and attention mechanisms.

  • We validate our method through extensive experiments on different datasets, networks, and tasks, and show remarkable improvements over prior methods (cf. Fig. 1).

2 Related Work

Image Priors for Restoration. Different restoration tasks demand distinct image priors, such as noise levels for denoising and blurring kernels for deblurring. Due to the ill-posed nature of restoration, estimating priors is difficult. In real-world scenarios, these priors are typically intertwined, adding further complexity to the restoration process. Recent literature introduces several methods to improve restoration by leveraging multi-modal maps as unified priors. These methods predominantly rely on pre-computed physical multi-modal maps. For instance, SKF [41] uses semantic maps to optimize the feature space for low-light enhancement. SMG [46] employs a generative framework to integrate edge, depth, and semantic information, enhancing the initial appearance modeling for low-light scenarios. Additionally, some approaches use Near-Infrared (NIR) information to refine imaging results [12, 32]. These priors are also applied to other restoration tasks, such as image denoising [20] and deraining [18]. However, aligning these priors with the input image can be challenging, and errors in the priors may adversely impact performance. Different from existing works, we propose to leverage pre-trained models as priors to enhance image restoration.

Pre-Trained Models for Downstream Tasks. Recently, a series of pre-trained models with large-scale training datasets have emerged, particularly in the form of multi-modal models such as CLIP [27], BLIP [16], and BLIP2 [17]. The feature space learned by these models offers rich knowledge that can benefit various tasks. While previous work has demonstrated the effectiveness of CLIP in high-level tasks like zero-shot classification [6, 53], image editing [25, 4], open-world segmentation [60, 39], and 3D classification [59, 47], its potential for aiding low-level restoration tasks remains unexplored. Only the capability of employing such for image quality assessment, as demonstrated in CLIP-IQA, has been explored. We propose a general framework to leverage pre-trained models to improve various restoration tasks.

3 Methods

Background. Let Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represent a degraded image, and Icsubscript𝐼𝑐I_{c}italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denote the corresponding ground-truth (without degradation). A restoration network \mathcal{F}caligraphic_F produces restored image I^c=(Id)subscript^𝐼𝑐subscript𝐼𝑑\hat{I}_{c}=\mathcal{F}(I_{d})over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_F ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Despite the existence of various effective network structures \mathcal{F}caligraphic_F that have been proposed, there are current upper bounds in these tasks. Breaking through these bounds often requires designing more complex networks and training strategies, which can be arduous. Additionally, innovations in network architecture or training strategies for one task might not translate to another. While different priors g𝑔gitalic_g have been introduced into the restoration process, including image and physical priors, estimating these priors is difficult.

Motivation. We hypothesize that the prior g𝑔gitalic_g can be effectively represented as the feature extracted from various pre-trained models 𝒢𝒢\mathcal{G}caligraphic_G, as g=𝒢(Id)𝑔𝒢subscript𝐼𝑑g=\mathcal{G}(I_{d})italic_g = caligraphic_G ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Note that 𝒢𝒢\mathcal{G}caligraphic_G is typically not trained with restoration targets but might have been exposed to images with diverse degradations. So it is likely to learn useful information to help image restoration. We propose a novel approach that uses g𝑔gitalic_g to improve the initial restoration by \mathcal{F}caligraphic_F, even if these networks have already reached their current upper bounds.

Challenge. Using g𝑔gitalic_g to assist \mathcal{F}caligraphic_F is non-trivial. Primarily, the feature g𝑔gitalic_g is not inherently aligned with the restoration tasks because they might represent different aspects. For instance, features from CLIP focus more on semantic information, making direct alignment to restoration challenging. Moreover, these priors exhibit varying shapes, such as the one-dimensional (1D) features from the CLIP model, while the features in \mathcal{F}caligraphic_F are typically 2D. To reconcile the discrepancies in both representation and shape, we propose a refinement module \mathcal{R}caligraphic_R to refine the initial restoration by \mathcal{F}caligraphic_F. This eliminates the need to align g𝑔gitalic_g to distinct features of \mathcal{F}caligraphic_F and allows for a unified 1D representation for g𝑔gitalic_g. Furthermore, we introduce a novel approach to utilize g𝑔gitalic_g to formulate optimal neural operating ranges via an effective attention mechanism in \mathcal{R}caligraphic_R. This implicitly distills restoration-related information, effectively boosting the final performance.

Refer to caption
Figure 3: The pipeline of PTG-SVE and PTG-CSA. In PTG-SVE, we use the learnable spatial embedding 𝒮msubscript𝒮𝑚\mathcal{S}_{m}caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, OSF g𝑔gitalic_g, and input feature f𝑓fitalic_f to adaptively formulate spatial weights (M𝑀Mitalic_M, Eq. 2) for fusing short- and long-range processed features (fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) via operations ssubscript𝑠\mathcal{R}_{s}caligraphic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and lsubscript𝑙\mathcal{R}_{l}caligraphic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, yielding f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG (Eq. 3). In PTG-CSA, OSF g𝑔gitalic_g conditions channel attention Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG through csubscript𝑐\mathcal{R}_{c}caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (Eq. 4). Additionally, g𝑔gitalic_g combines with learnable spatial representation 𝒮csubscript𝒮𝑐\mathcal{S}_{c}caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG to generate spatial attention map Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, using spatial-wise convolutions Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (obtained via psubscript𝑝\mathcal{R}_{p}caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) to derive ^ssubscript^𝑠\hat{\mathcal{M}}_{s}over^ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that is further processed with osubscript𝑜\mathcal{R}_{o}caligraphic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT (Eqs. 5 and 6). Channel- and spatial-attention outputs (f^csubscript^𝑓𝑐\hat{f}_{c}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and f^ssubscript^𝑓𝑠\hat{f}_{s}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) merge via fsubscript𝑓\mathcal{R}_{f}caligraphic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to enhance feature f¯¯𝑓\bar{f}over¯ start_ARG italic_f end_ARG (Eq. 7).

3.1 Overview of Refinement Module

Fig. 2 depicts the restoration pipeline using our method. Given an input image Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we have an initial restoration result as I^c=(Id)subscript^𝐼𝑐subscript𝐼𝑑\hat{I}_{c}=\mathcal{F}(I_{d})over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_F ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). We aim to refine the result using the proposed pre-training guided refinement module (PTG-RM) \mathcal{R}caligraphic_R, resulting in I¯c=(I^c,Id,g)subscript¯𝐼𝑐subscript^𝐼𝑐subscript𝐼𝑑𝑔\bar{I}_{c}=\mathcal{R}(\hat{I}_{c},I_{d},g)over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_R ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_g ). The key of this approach is to distill restoration-related information from the prior g𝑔gitalic_g.

\mathcal{R}caligraphic_R is a simple encoder-decoder structure. The encoder and decoder of \mathcal{R}caligraphic_R are denoted as esubscript𝑒\mathcal{R}_{e}caligraphic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and dsubscript𝑑\mathcal{R}_{d}caligraphic_R start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, respectively. To ensure lightweight implementation, distillation occurs in the latent space, avoiding the need to align g𝑔gitalic_g with restoration-related features. The latent feature f𝑓fitalic_f is derived through a comparison between the initial enhanced results and the original input images, given as f=e(I^cId)𝑓subscript𝑒direct-sumsubscript^𝐼𝑐subscript𝐼𝑑f=\mathcal{R}_{e}(\hat{I}_{c}\oplus I_{d})italic_f = caligraphic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊕ italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), where direct-sum\oplus denotes the concatenation operation. The resulting f𝑓fitalic_f is in h×w×csuperscript𝑤𝑐\mathbb{R}^{h\times w\times c}blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT, with hhitalic_h, w𝑤witalic_w, and c𝑐citalic_c representing feature height, width, and channel number, respectively. The priors are used in further learning the latent feature as f¯=𝒞(𝒜(f,g),g)¯𝑓𝒞𝒜𝑓𝑔𝑔\bar{f}=\mathcal{C}(\mathcal{A}(f,g),g)over¯ start_ARG italic_f end_ARG = caligraphic_C ( caligraphic_A ( italic_f , italic_g ) , italic_g ), where 𝒜𝒜\mathcal{A}caligraphic_A and 𝒞𝒞\mathcal{C}caligraphic_C represent the Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE) and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA) modules, respectively. The final enhancement is obtained from the decoder as [Im,Ir]=d(f^)subscript𝐼𝑚subscript𝐼𝑟subscript𝑑^𝑓[I_{m},I_{r}]=\mathcal{R}_{d}(\hat{f})[ italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] = caligraphic_R start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG ), comprising two components. The first component, Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, represents the correction mask used to mitigate errors in the initial enhancement results. The second component, Irsubscript𝐼𝑟I_{r}italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, is the residual refinement that addresses artifacts and adds additional details. The final result is denoted as

I¯c=Id+(I^cId)×Im+Ir.subscript¯𝐼𝑐subscript𝐼𝑑subscript^𝐼𝑐subscript𝐼𝑑subscript𝐼𝑚subscript𝐼𝑟\bar{I}_{c}=I_{d}+(\hat{I}_{c}-I_{d})\times I_{m}+I_{r}.over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) × italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT . (1)

3.2 Pre-Train-Guided Spatial-Varying Operations

In PTG-SVE, we argue that g=𝒢(Id)𝑔𝒢subscript𝐼𝑑g=\mathcal{G}(I_{d})italic_g = caligraphic_G ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) may contain information reflecting the pixel-level image quality of Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. For areas with poor quality, long-range operations are used to capture non-local features, while regions with relatively good quality prioritize local features for accurate restoration.

In Fig. 3, the primary objective is to predict the optimal neural operation range for each location of the feature map f𝑓fitalic_f, which we refer to as the “range score map”, denoted as M𝑀Mitalic_M. To ensure a general \mathcal{R}caligraphic_R with unified 1D priors g𝑔gitalic_g from various models, we propose adding location-aware embeddings for the priors, thereby adaptively discovering quality information for different pixels. Let S={(x,y)|x[1,w],y[1,h]}𝑆conditional-set𝑥𝑦formulae-sequence𝑥1𝑤𝑦1S=\{(x,y)|x\in[1,w],y\in[1,h]\}italic_S = { ( italic_x , italic_y ) | italic_x ∈ [ 1 , italic_w ] , italic_y ∈ [ 1 , italic_h ] } represent the 2D coordinate map with dimensions h×w×2𝑤2h\times w\times 2italic_h × italic_w × 2. We use a position embedding module 𝒫𝒫\mathcal{P}caligraphic_P to generate spatial representation, denoted as 𝒮m=𝒫(S)subscript𝒮𝑚𝒫𝑆\mathcal{S}_{m}=\mathcal{P}(S)caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = caligraphic_P ( italic_S ), where 𝒮h×w×c𝒮superscript𝑤𝑐\mathcal{S}\in\mathbb{R}^{h\times w\times c}caligraphic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT. Furthermore, to determine the admired neural operation range for each location of f𝑓fitalic_f, we use a learnable mapping function 𝒯msubscript𝒯𝑚\mathcal{T}_{m}caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to transform the priors to another space that can more effectively decide the optimal range. To obtain M𝑀Mitalic_M, we use a range-learning module msubscript𝑚\mathcal{R}_{m}caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which takes the encoder’s feature f𝑓fitalic_f, the pre-trained prior g𝑔gitalic_g, and the spatial representation 𝒮msubscript𝒮𝑚\mathcal{S}_{m}caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as inputs. The procedure is denoted as

M=m(f𝒯m(g)𝒮m).𝑀subscript𝑚direct-sum𝑓subscript𝒯𝑚𝑔subscript𝒮𝑚M=\mathcal{R}_{m}(f\oplus\mathcal{T}_{m}(g)\oplus\mathcal{S}_{m}).italic_M = caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_f ⊕ caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_g ) ⊕ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (2)

Nach [42], we use CNN for the short-range operation, denoted as ssubscript𝑠\mathcal{R}_{s}caligraphic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and transformer for the long-range operation, represented as lsubscript𝑙\mathcal{R}_{l}caligraphic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Specifically, we employ the Restormer backbone for lsubscript𝑙\mathcal{R}_{l}caligraphic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ResNet for ssubscript𝑠\mathcal{R}_{s}caligraphic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Suppose the features after the short- and long-range operation are fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, respectively. We can obtain the refined feature f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG as

fs=s(f),fl=l(f),f^=M×fs+(1M)×fl.formulae-sequencesubscript𝑓𝑠subscript𝑠𝑓formulae-sequencesubscript𝑓𝑙subscript𝑙𝑓^𝑓𝑀subscript𝑓𝑠1𝑀subscript𝑓𝑙f_{s}=\mathcal{R}_{s}(f),\,f_{l}=\mathcal{R}_{l}(f),\,\hat{f}=M\times f_{s}+(1% -M)\times f_{l}.italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f ) , italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_f ) , over^ start_ARG italic_f end_ARG = italic_M × italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ( 1 - italic_M ) × italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT . (3)

The previous approach [42] relies on pre-computed SNR values, which may not always be accurate and can fail to enhance results, especially when the initial results from \mathcal{F}caligraphic_F have reached their upper bound. In contrast, our score range map is learned online based on the input image, restoration-related priors, and explicit spatial features that are learnable. This flexibility allows us to handle various situations, resulting in better performance and generalization (as demonstrated in the ablation study).

Datasets Methods Original +Ours-c +Ours-b +Ours-s +Ours-r +Ours-f
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
LOL UHD 19.87 0.706 22.91 (+3.04) 0.767 (+6.1) 21.83 (+1.96) 0.732 (+2.6) 22.35 (+2.48) 0.758 (+5.2) 21.71 (+1.84) 0.737 (+3.1) 22.74 (+2.87) 0.764 (+5.8)
URetinex 21.16 0.840 24.70 (+3.54) 0.878 (+3.8) 23.57 (+2.41) 0.869 (+2.9) 24.23 (+3.07) 0.866 (+2.6) 23.99 (+2.83) 0.862 (+2.2) 24.56 (+3.40) 0.870 (+3.0)
SNR 21.48 0.849 25.50 (+4.02) 0.892 (+4.3) 25.61 (+4.13) 0.891 (+4.2) 25.19 (+3.71) 0.874 (+2.5) 25.24 (+3.76) 0.887 (+3.8) 24.90 (+3.42) 0.888 (+3.9)
SID UHD 20.46 0.614 20.99 (+0.53) 0.616 (+0.2) 21.06 (+0.60) 0.619 (+0.5) 22.34 (+1.88) 0.625 (+1.1) 21.11 (+0.65) 0.618 (+0.4) 21.08 (+0.62) 0.619 (+0.5)
URetinex 21.56 0.619 22.34 (+0.78) 0.623 (+0.4) 22.02 (+0.46) 0.621 (+0.2) 22.21 (+0.65) 0.623 (+0.4) 22.17 (+0.61) 0.625 (+0.6) 22.40 (+0.84) 0.626 (+0.7)
SNR 22.87 0.625 23.34 (+0.47) 0.630 (+0.5) 23.15 (+0.28) 0.627 (+0.2) 23.08 (+0.21) 0.631 (+0.6) 23.06 (+0.19) 0.632 (+0.7) 23.17 (+0.30) 0.636 (+1.1)
Table 1: Comparisons on LOL-real and SID dataset. c𝑐-c- italic_c, b𝑏-b- italic_b, s𝑠-s- italic_s, and r𝑟-r- italic_r refer to using CLIP, BLIP2, Stable Diffusion, and restoration models trained on SDSD, respectively. f𝑓-f- italic_f denotes applying refinement on the features of \mathcal{F}caligraphic_F. (+) indicates improvements for PSNR and SSIM(x100)x100{}_{({\rm x}100)}start_FLOATSUBSCRIPT ( x100 ) end_FLOATSUBSCRIPT.

3.3 Pre-Train-Guided Attention

As shown in Fig. 3, we further introduce a lightweight component that utilizes pre-trained priors g𝑔gitalic_g to create an effective attention mechanism in \mathcal{R}caligraphic_R. Optimizing the feature attention in \mathcal{R}caligraphic_R is crucial for effectively identifying helpful features to enhance the initial results I^csubscript^𝐼𝑐\hat{I}_{c}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This involves both spatial-level and channel-level attentions. The hidden restoration-related information in g𝑔gitalic_g can be discovered by using g𝑔gitalic_g to improve the restoration features in \mathcal{R}caligraphic_R conditioned on them.

We begin by formulating the attention computation at the channel level. We introduce a mapping function 𝒯csubscript𝒯𝑐\mathcal{T}_{c}caligraphic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to transform g𝑔gitalic_g into the attention-prediction space, and utilize the channel attention computation module csubscript𝑐\mathcal{R}_{c}caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The formulation of the channel attention is

Mc=c(𝒪(f^)𝒯c(g)),f^c=f^×Mc,formulae-sequencesubscript𝑀𝑐subscript𝑐direct-sum𝒪^𝑓subscript𝒯𝑐𝑔subscript^𝑓𝑐^𝑓subscript𝑀𝑐\displaystyle M_{c}=\mathcal{R}_{c}(\mathcal{O}(\hat{f})\oplus\mathcal{T}_{c}(% g)),\;\hat{f}_{c}=\hat{f}\times M_{c},italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_O ( over^ start_ARG italic_f end_ARG ) ⊕ caligraphic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_g ) ) , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG × italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (4)

where 𝒪𝒪\mathcal{O}caligraphic_O is the pooling operation, and ccsubscript𝑐superscript𝑐\mathcal{M}_{c}\in\mathbb{R}^{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

As for the spatial-attention computation, we utilize the 1D pre-trained prior g𝑔gitalic_g to predict location-wise attention based on the feature distribution of each location in f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG. Simply using the spatial location information, as shown in Eq. 2, results in each pixel’s feature considering a similar condition for neighboring features, limiting the elimination of spatial artifacts. Therefore, we propose an alternative strategy by predicting the neural operation parameters for each location, optimizing the spatial attention based on the varying location-wise feature distribution. We denote the spatial attention computation module as psubscript𝑝\mathcal{R}_{p}caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and first formulate the location-wise convolution map, as

𝒞p=p(f^,𝒯c(g),𝒮c),subscript𝒞𝑝subscript𝑝^𝑓subscript𝒯𝑐𝑔subscript𝒮𝑐\mathcal{C}_{p}=\mathcal{R}_{p}(\hat{f},\mathcal{T}_{c}(g),\mathcal{S}_{c}),caligraphic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG , caligraphic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_g ) , caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (5)

where the obtained convolution map 𝒞ph×w×(kh×kw×c)subscript𝒞𝑝superscript𝑤subscript𝑘subscript𝑘𝑤𝑐\mathcal{C}_{p}\in\mathbb{R}^{h\times w\times(k_{h}\times k_{w}\times c)}caligraphic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × ( italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_c ) end_POSTSUPERSCRIPT, khsubscript𝑘k_{h}italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and kwsubscript𝑘𝑤k_{w}italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are the convolution kernel size, and 𝒮csubscript𝒮𝑐\mathcal{S}_{c}caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is another learnable position embedding here. The obtained convolution maps can be utilized to optimize the feature, and spatial attention can be obtained as

M^s=f^*𝒞p,Ms=o(M^s),formulae-sequencesubscript^𝑀𝑠^𝑓subscript𝒞𝑝subscript𝑀𝑠subscript𝑜subscript^𝑀𝑠\hat{M}_{s}=\hat{f}*\mathcal{C}_{p},\,M_{s}=\mathcal{R}_{o}(\hat{M}_{s}),over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG * caligraphic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (6)

where *** is the convolution operation for each location, and osubscript𝑜\mathcal{R}_{o}caligraphic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is another learnable operation which mapps the feature channel c𝑐citalic_c to 1, eliminating the influence from the channel-level dependency. Further, the feature after spatial attention can be described as f^s=f^×Mssubscript^𝑓𝑠^𝑓subscript𝑀𝑠\hat{f}_{s}=\hat{f}\times M_{s}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG × italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

The features after spatial and channel attentions can be merged via a fusion module as

f¯=f(f^cf^s),¯𝑓subscript𝑓direct-sumsubscript^𝑓𝑐subscript^𝑓𝑠\bar{f}=\mathcal{R}_{f}(\hat{f}_{c}\oplus\hat{f}_{s}),over¯ start_ARG italic_f end_ARG = caligraphic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊕ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (7)

where fsubscript𝑓\mathcal{R}_{f}caligraphic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the fusin module. The obtained feature f¯¯𝑓\bar{f}over¯ start_ARG italic_f end_ARG can be processed via a decoder dsubscript𝑑\mathcal{R}_{d}caligraphic_R start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to obtain the residual refinement Irsubscript𝐼𝑟I_{r}italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and the mask Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as indicated in Eq. 1.

3.4 Loss Function

Our designed \mathcal{R}caligraphic_R can be jointly trained with the model \mathcal{F}caligraphic_F. Suppose the paired ground truth for the input image Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is csubscript𝑐\mathcal{I}_{c}caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and the loss function for the model \mathcal{F}caligraphic_F is denoted as g(I^c,c)subscript𝑔subscript^𝐼𝑐subscript𝑐\mathcal{L}_{g}(\hat{I}_{c},\mathcal{I}_{c})caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (is usually the reconstruction loss in the pixel level or perceptual loss, and can also be the unsupervised loss), then the loss function for the refinement module can be written as g(I¯c,c)subscript𝑔subscript¯𝐼𝑐subscript𝑐\mathcal{L}_{g}(\bar{I}_{c},\mathcal{I}_{c})caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). In summary, the overall loss is

g(I^c,c)+λ1g(I¯c,c),subscript𝑔subscript^𝐼𝑐subscript𝑐subscript𝜆1subscript𝑔subscript¯𝐼𝑐subscript𝑐\mathcal{L}_{g}(\hat{I}_{c},\mathcal{I}_{c})+\lambda_{1}\mathcal{L}_{g}(\bar{I% }_{c},\mathcal{I}_{c}),caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the loss weight and remains robust across various tasks and networks (in our experiments, λ1𝜆1\lambda 1italic_λ 1 is always set as 1).

Methods

SNR

+SKF

+SMG

+SMG(dep)

+Ours-c

PSNR

21.48

23.05

24.84

24.12

25.50
SSIM

0.849

0.853

0.880

0.851

0.892
Methods

URetinex

+SKF

+SMG

+SMG(dep)

+Ours-c

PSNR

21.16

23.51

23.74

23.25

24.70
SSIM

0.840

0.856

0.852

0.849

0.878
+Params

0

2.15M

16.76M

16.76M

0.67M

Table 2: Quantitative comparison on the LOL-real dataset. +Params means the additional parameter number compared with original \mathcal{F}caligraphic_F.

4 Experiments

We first introduce tasks and datasets used in experiments, followed by a detailed analysis of our methd using low-light image enhancement as an example. We also demonstrate the effectiveness of our method on other tasks.

4.1 Tasks and Datasets

For low-light enhancement, we use the SID [5] and LOL-real [49] datasets. For deraining, we use the Rain13K [52] dataset for training and test on Rain100H [48], Rain100L [48], Test100 [55], Test1200 [54], and Test2800 [8] datasets. For gaussian denoising, we use two settings: synthetic noise on Set12 [56], BSD68 [23], CBSD68 [23], Kodak [7], McMaster [58], and Urban100 [10]; and real-world denoising on SIDD [2]. For single-image motion deblurring, we use the GoPro [24] dataset for training and evaluate on synthetic datasets (GoPro [24], HIDE [30]) and real-world datasets (RealBlur-R [28], RealBlur-J [28]). For defocus deblurring, we use the DPDD [3] training data and test on the EBDB [13] and JNB [31] datasets.

Refer to caption
Input
Refer to caption
SNR
Refer to caption
SNR+Ours
Refer to caption
Ground-truth
Refer to caption
Input
Refer to caption
SNR
Refer to caption
SNR+Ours
Refer to caption
Ground-truth
Figure 4: Comparisons on LOL-real (top) and SID (bottom). Results with “Ours” have less noise and clearer visibility.
Table 3: Ablation study results. We adopt CLIP as the pre-trained model. “SP” denote PTG-SVE, “CA” and “SA” denote spatial- and channel attentions in PTG-CSA. Con. means Concatenation.
LOL-real SID
URetinex SNR URetinex SNR
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
w/o SP, with CA and SA 23.45 0.868 24.25 0.886 21.98 0.619 23.02 0.620
with SP, w/o CA, with SA 22.10 0.856 24.05 0.875 22.05 0.623 22.93 0.624
with SP and CA, w/o SA 23.76 0.850 23.86 0.879 21.92 0.620 23.07 0.621
Large \mathcal{R}caligraphic_R w/o SP/CA/SA 22.74 0.857 24.51 0.881 22.06 0.621 23.04 0.627
w/o Position Embedding 𝒮𝒮\mathcal{S}caligraphic_S 23.66 0.843 24.13 0.874 22.13 0.620 22.92 0.622
SNR Value as Mask 22.66 0.855 24.77 0.887 22.01 0.617 22.94 0.627
Use 1D Priors via Con. 23.01 0.853 23.83 0.878 22.07 0.622 22.93 0.628
Use 2D Priors via Con. 22.68 0.862 24.11 0.880 22.08 0.618 23.06 0.625
Full Setting 24.70 0.878 25.50 0.892 22.34 0.623 23.34 0.630
Datasets LOL-real SID
Methods

ZeroDCE

RUAS

SCI

ZeroDCE

RUAS

SCI

PSNR

18.06

18.37

20.28

18.08

18.44

19.09

SSIM

0.580

0.723

0.752

0.576

0.581

0.585

Methods

+Ours-c

+Ours-c

+Ours-c

+Ours-c

+Ours-c

+Ours-c

PSNR 18.79 19.53 21.62 18.65 18.93 19.61
SSIM 0.614 0.747 0.781 0.593 0.590 0.598
Table 3: Ablation study results. We adopt CLIP as the pre-trained model. “SP” denote PTG-SVE, “CA” and “SA” denote spatial- and channel attentions in PTG-CSA. Con. means Concatenation.
Table 4: Quantitative comparison on the LOL-real and SID dataset for unsupervised methods. We adopt CLIP as the pre-trained model here.

4.2 Low-light Image Enhancement

Comparison. We choose current SOTA low-light image enhancement methods as the baselines (UHD [35], URetinex [40], SNR [42]), and apply our refinement module for these baselines to see if their performance can be improved. The priors are chosen from the CLIP [27], BLIP2 [17], Stable Diffusion [29], and pre-trained restoration models (trained on another dataset, as SDSD [34, 45]). We denote these results as c𝑐-c- italic_c, b𝑏-b- italic_b, s𝑠-s- italic_s, and r𝑟-r- italic_r, respectively. In Table 1, we observe that combining these priors with our refinement module significantly improves the performance of the baselines. Additionally, Fig. 4 provides visual comparisons.

Moreover, we conducted an experiment by adding the refinement module to the intermediate layer of \mathcal{F}caligraphic_F, refining features of the target model. The refinement module is added to the deepest feature layer for efficiency, producing the residual feature map and the mask information for refinement. These results are denoted as f𝑓-f- italic_f. The improvement achieved by this operation is also evident as displayed in Table 1.

Comparison with Other Priors. Some works, such as SKF [41] and SMG [46], utilize additional information like semantic maps, edge maps, and depth maps to enhance low-light image enhancement results. However, these methods require supervision with paired multi-modal information, whereas our method does not. Additionally, as shown in Table 2, our approach achieves better performance improvement for a given target model. Notably, the improvements achieved by other methods are based on large additional parameters, while our approach only uses a lightweight refinement module <<< 1M.

Method PSNR {\color[rgb]{0,0,0}\uparrow} SSIM {\color[rgb]{0,0,0}\uparrow} PSNR {\color[rgb]{0,0,0}\uparrow} SSIM {\color[rgb]{0,0,0}\uparrow} PSNR {\color[rgb]{0,0,0}\uparrow} SSIM {\color[rgb]{0,0,0}\uparrow}
Test100 Rain100H Rain100L
SPAIR 30.35 0.909 30.95 0.892 36.93 0.969
SPAIR+Ours-c 30.62 0.917 31.20 0.901 37.26 0.973
Restormer 32.00 0.923 31.46 0.904 38.99 0.978
Restormer+Ours-c 32.30 0.934 31.77 0.913 39.27 0.985
Test2800 Test1200 Average
SPAIR 33.34 0.936 33.04 0.922 32.91 0.926
SPAIR+Ours-c 33.58 0.942 33.35 0.924 33.16 0.932
Restormer 34.18 0.944 33.19 0.926 33.96 0.935
Restormer+Ours-c 34.47 0.951 33.48 0.929 34.24 0.943
Table 5: Image deraining results.
Refer to caption
Input
Refer to caption
Ground-truth
Refer to caption
Restormer
Refer to caption
Restormer+Ours
Figure 5: Visual comparison on Rain100H showing the effects of our strategy.

Ablation Study: Ablation of Different Components. We first set experiments by deleting different components from our framework, including PTG-SVE (abbreviated as “SP”), and spatial-channel attentions with priors that are abbreviated as “CA” and “SA”, respectively. As shown in Table 4, deleting any component will lead to a performance drop.

We conduct experiments without SP, CA, or SA to analyze whether additional parameters or priors take a prominent role. The short-range and long-range results are fused via a simple sum, and the spatial-channel attention is conducted using only the features themselves. Additionally, we increase the feature channel number fourfold to add more parameters. The results, denoted as “Large \mathcal{R}caligraphic_R w/o SP/CA/SA” in Table 4, are still lower than our full setting, indicating the effectiveness of our proposed approach over simply increasing parameters.

In addition, we perform an experiment by removing the learnable position embeddings 𝒮msubscript𝒮𝑚\mathcal{S}_{m}caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒮csubscript𝒮𝑐\mathcal{S}_{c}caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, denoted as “w/o Position Embedding for Priors” in Table 4. This comparison highlights the importance of using spatial-aware representations for the pre-trained features.

Ablation Study: SNR Value as Mask. In comparison to previous methods that directly use the SNR value as the mask to fuse the short- and long-range results, our approach utilizes pre-trained priors to automatically discover restoration-related information and formulate the fusion mask adaptively. In this ablation study, we demonstrate that our strategy outperforms the direct SNR-based approach, as shown in Table 4.

Method GoPro HIDE RealBlur-R RealBlur-J
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
MPRNet 32.66 0.959 30.96 0.939 35.99 0.952 28.70 0.873
MPRNet+Ours-c 32.87 0.964 31.19 0.943 36.25 0.960 28.98 0.881
Restormer 32.92 0.961 31.22 0.942 36.19 0.957 28.96 0.879
Restormer+Ours-c 33.18 0.966 31.51 0.950 36.47 0.962 29.21 0.883
Table 6: Single-image motion deblurring results.
Figure 6: Visual comparison on HIDE.
Refer to caption
Input
Refer to caption
Ground-truth
Refer to caption
Restormer
Refer to caption
Restormer+Ours
Refer to caption
Input
Refer to caption
Ground-truth
Refer to caption
GRL
Refer to caption
GRL+Ours
Figure 6: Visual comparison on HIDE.
Figure 7: Visual comparison on single-image defocus deblurring.

Ablation Study: Alternatives of Using Priors. In this study, we demonstrate the difficulty of directly aligning priors to the restoration features. We conduct an experiment where the priors are concatenated with the features in the refinement module to implement different components. However, the improvement obtained with this direct approach is not as significant as our proposed method, as shown in Table 4. This is because the different features are heterogeneous with the restoration features, even when the priors are adopted as 2D feature maps. This study highlights the importance of our novel strategy of employing these priors.

\mathcal{R}caligraphic_R for Unsupervised Approach. Different from existing refinement methods that need supervision for learning the additional features (e.g., SKF needs the semantic ground truth of the normal-light data, SMG needs the depth and edge information of the normal-light data), our approach does not require the feature of the normal-light data during both training and inference. We only need the feature that is extracted from Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with the pre-trained model 𝒢𝒢\mathcal{G}caligraphic_G during the training. Also, the loss function for training the refinement module can be set the same as that of the target model. Thus, the unsupervised training of the target model can also be adopted in our framework. As shown in Table 4, our method can successfully improve the performance of various unsupervised low-light image enhancement methods with different unsupervised loss terms, including EnGAN [11], ZeroDCE [9], RUAS [21], and SCI [22].

4.3 Other Restoration Tasks

In this section, we conduct experiments using CLIP as the pre-trained model (c𝑐-c- italic_c). CLIP is chosen for its efficiency and convenience compared to other pre-trained models.

Deraining. For deraining tasks, we use SOTA methods such as SPAIR [26] and Restormer [52] as baselines. We compute PSNR/SSIM values using the Y channel in the YCbCr color space, similar to existing methods. Table 5 demonstrates that our approach improves the performance of these existing methods and consistently achieves significant performance gains across all five datasets. The qualitative comparison results are shown in Fig. 5.

Motion Deblurring. We analyze our approach for deblurring tasks on synthetic datasets (GoPro, HIDE) and real-world datasets (RealBlur-R, RealBlur-J). The baselines include MPRNet [51] and Restormer [52]. Table 6 demonstrates that our approach improves the performance of all these methods on all four benchmark datasets. Although the enhanced network is trained only on the GoPro dataset, it shows more robust generalization to other datasets. Qualitative comparisons are shown in Fig. 7, further supporting our claim.

Defocus Deblurring. Table 8 presents the image fidelity scores of SOTA approaches on the DPDD dataset [3], including IFAN [15], Restormer [52], and GRL [19]. Our refinement module achieves significant performance improvement for these SOTA schemes in both single-image and dual-pixel defocus deblurring settings across all scene categories. The qualitative results are depicted in Fig. 7.

Table 7: Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring.
Method Indoor Scenes Outdoor Scenes Combined
PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS
IFANS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT 28.11 0.861 0.179 22.76 0.720 0.254 25.37 0.789 0.217
IFANS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT+Ours-c 28.32 0.870 0.171 23.08 0.727 0.248 25.72 0.795 0.213
RestormerS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT 28.87 0.882 0.145 23.24 0.743 0.209 25.98 0.811 0.178
RestormerS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT+Ours-c 29.17 0.890 0.141 23.43 0.749 0.206 26.13 0.816 0.165
GRLS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT-B 29.06 0.886 0.139 23.45 0.761 0.196 26.18 0.822 0.168
GRLS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT-B +Ours-c 29.30 0.894 0.133 23.67 0.768 0.189 26.45 0.828 0.161
IFAND𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT 28.66 0.868 0.172 23.46 0.743 0.240 25.99 0.804 0.207
IFAND𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT+Ours-c 28.94 0.875 0.167 23.70 0.748 0.235 26.20 0.811 0.203
RestormerD𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT 29.48 0.895 0.134 23.97 0.773 0.175 26.66 0.833 0.155
RestormerD𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT+Ours-c 29.79 0.902 0.131 24.23 0.778 0.155 26.89 0.840 0.153
GRLD𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT-B 29.83 0.903 0.114 24.39 0.795 0.150 27.04 0.847 0.133
GRLD𝐷{}_{D}start_FLOATSUBSCRIPT italic_D end_FLOATSUBSCRIPT-B+Ours-c 29.96 0.911 0.110 24.62 0.803 0.145 27.27 0.855 0.128
Method Set12 BSD68 Urban100
σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050 σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050 σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050
DRUNet 33.25 30.94 27.90 31.91 29.48 26.59 33.44 31.11 27.96
DRUNet+Ours-c 33.51 31.18 28.27 32.20 29.73 26.84 33.65 31.34 28.16
Restormer 33.35 31.04 28.01 31.95 29.51 26.62 33.67 31.39 28.33
Restormer+Ours-c 33.57 31.28 28.36 32.11 29.78 26.91 33.96 31.67 28.58
Restormer 33.42 31.08 28.00 31.96 29.52 26.62 33.79 31.46 28.29
Restormer+Ours-c 33.70 31.29 28.35 32.24 29.81 26.86 33.97 31.73 28.58
GRL-B 33.47 31.12 28.03 32.00 29.54 26.60 34.09 31.80 28.59
GRL-B+Ours-c 33.74 31.30 28.37 32.29 29.76 26.91 34.22 31.95 28.74
Table 7: Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring.
Table 8: Gaussian grayscale image denoising comparisons. Top super rows: learning a single model to handle various noise levels. Bottom super rows: training a separate model for each noise level.

Gaussian Denoising. We conduct denoising experiments on synthetic benchmark datasets with additive white Gaussian noise. We choose DRUNet [57], Restormer [52], and GRL [19] as baselines, which are SOTA approaches in denoising. Tables 8 and 10 present PSNR scores of different approaches on grayscale and color image denoising, respectively, for noise levels of 15, 25, and 50. We evaluate two experimental settings: (1) learning a single model to handle various noise levels and (2) learning separate models for each noise level. Our method achieves significant performance enhancement for all these methods under both experimental settings on different datasets and noise levels. The visual results are shown in Fig. 9, showing the effectiveness of our strategy.

Table 9: Gaussian color image denoising. Equivalent notation meanings (top and bottom rows) as those in Table 8.
Method CBSD68 Kodak24 McMaster Urban100
σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050 σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050 σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050 σ𝜎\sigmaitalic_σ===15151515 σ𝜎\sigmaitalic_σ===25252525 σ𝜎\sigmaitalic_σ===50505050
DRUNet 34.30 31.69 28.51 35.31 32.89 29.86 35.40 33.14 30.08 34.81 32.60 29.61
+Ours-c 34.54 31.97 28.76 35.58 33.15 29.97 35.71 33.50 30.25 35.10 32.82 29.83
Restormer 34.39 31.78 28.59 35.44 33.02 30.00 35.55 33.31 30.29 35.06 32.91 30.02
+Ours-c 34.63 32.04 28.88 35.65 33.26 30.15 35.86 33.64 30.63 35.26 33.22 30.21
Restormer 34.40 31.79 28.60 35.47 33.04 30.01 35.61 33.34 30.30 35.13 32.96 30.02
+Ours-c 34.76 32.05 28.94 35.72 33.27 30.21 35.80 33.63 30.55 35.32 33.14 30.27
GRL-B 34.45 31.82 28.62 35.43 33.02 29.93 35.73 33.46 30.36 35.54 33.35 30.46
+Ours-c 34.73 32.07 28.90 35.71 33.24 30.18 35.96 33.75 30.62 35.70 33.57 30.64
Dataset Method

MPRNet

MPRNet + Ours-c

Uformer

Uformer + Ours-c

Restormer

Restormer + Ours-c

SIDD PSNR {\color[rgb]{0,0,0}\uparrow}

39.71

39.93

39.77

39.94

40.02

40.22
SSIM {\color[rgb]{0,0,0}\uparrow}

0.958

0.961

0.959

0.962

0.960

0.965
Table 9: Gaussian color image denoising. Equivalent notation meanings (top and bottom rows) as those in Table 8.
Table 10: Real image denoising on the SIDD dataset.
Figure 8: Visual comparisons on Kodak (top) and SIDD (bottom).
Refer to caption
Input
Refer to caption
Ground-truth
Refer to caption
GRL
Refer to caption
GRL+Ours
Refer to caption
Input
Refer to caption
Ground-truth
Refer to caption
Restormer
Refer to caption
+Ours
Refer to caption
Figure 8: Visual comparisons on Kodak (top) and SIDD (bottom).
Figure 9: The user study results show that our strategy can effectively improve the performance of restoration approaches in terms of human subjective evaluation.

Real Denoising. We also conduct denoising experiments on the real-world SIDD dataset, with MPRNet [51], Uformer [37], and Restormer [52] as baselines. Table 10 demonstrates that our refinement method improves both PSNR and SSIM metrics. Notably, on the SIDD dataset, our refinement enables the SOTA approach Restormer to achieve a PSNR surpassing 40.2 dB. The visual comparison is shown in Fig. 9.

User Study. Furthermore, we conducted a large-scale user study with an A/B test strategy involving 80 participants. Each participant is asked to simultaneously see two restored results, i.e., baseline and baseline+ours, and gauge which one is better. As shown in Fig. 9, the results combined with our strategy are more preferred by the participants.

5 Conclusion

In this work, we explore the utilization of features from a pre-trained model to enhance the performance of a restoration model. By unifying the shapes of the pre-trained features, we introduce a novel refinement module PTG-RM that employs PTG-SVE and PTG-CSA mechanisms. Unlike existing strategies, we focus on formulating optimal operation ranges and attention strategies guided by the pre-trained features. The extensive experiments conducted on various tasks, datasets, and networks demonstrate the effectiveness and generalization ability of our approach. We believe that our proposed principle of discovering hidden useful information in pre-trained models can be applicable to other domains as well.

Limitation and Future Work. While our proposed strategy has exhibited significant effects in enhancing the performance of diverse restoration networks across various architectures with its lightweight module, the extent of improvement appears to vary across different experiments. Some instances showcase noticeable enhancement, while others do not. Such differences correlate with the capacity of the target network and the difficulty/complexity of the target task. In future endeavors, we intend to delve into more effective approaches that specifically aid target restoration tasks. We aim to employ a tailored distillation framework to derive refined restoration feature priors, ultimately making significant strides beyond existing upper boundaries. We also aim to develop corresponding technical products.

Acknowledgements. This work is supported by the Natural Science Foundation of Zhejiang Pvovince, China, under No. LD24F020002. SK is partially supported by University of Macau (SRG2023-00044-FST).

References

  • Aakerberg et al. [2022] Andreas Aakerberg, Anders S Johansen, Kamal Nasrollahi, and Thomas B Moeslund. Semantic segmentation guided real-world super-resolution. In WACV, 2022.
  • Abdelhamed et al. [2018] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In CVPR, 2018.
  • Abuolaim and Brown [2020] Abdullah Abuolaim and Michael S Brown. Defocus deblurring using dual-pixel data. In ECCV, 2020.
  • Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  • Chen et al. [2018] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In CVPR, 2018.
  • Esmaeilpour et al. [2022] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei Shu. Zero-shot out-of-distribution detection based on the pre-trained model clip. In AAAI, 2022.
  • Franzen [1999] Rich Franzen. Kodak lossless true color image suite. http://r0k.us/graphics/kodak/, 1999. Online accessed 24 Oct 2021.
  • Fu et al. [2017] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In CVPR, 2017.
  • Guo et al. [2020] Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. Zero-reference deep curve estimation for low-light image enhancement. In CVPR, 2020.
  • Huang et al. [2015] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, 2015.
  • Jiang et al. [2021] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. TIP, 2021.
  • Jin et al. [2022] Shuangping Jin, Bingbing Yu, Minhao Jing, Yi Zhou, Jiajun Liang, and Renhe Ji. Darkvisionnet: Low-light imaging via rgb-nir fusion with deep inconsistency prior. In AAAI, 2022.
  • Karaali and Jung [2017] Ali Karaali and Claudio Rosito Jung. Edge-based defocus blur estimation with adaptive scale selection. TIP, 2017.
  • Kong and Fowlkes [2018] Shu Kong and Charless Fowlkes. Image reconstruction with predictive filter flow. arXiv preprint arXiv:1811.11482, 2018.
  • Lee et al. [2021] Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, and Seungyong Lee. Iterative filter adaptive network for single image defocus deblurring. In CVPR, 2021.
  • Li et al. [2022a] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  • Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, 2023a.
  • Li et al. [2022b] Yi Li, Yi Chang, Changfeng Yu, and Luxin Yan. Close the loop: a unified bottom-up and top-down paradigm for joint image deraining and segmentation. In AAAI, 2022b.
  • Li et al. [2023b] Yawei Li, Yuchen Fan, Xiaoyu Xiang, Denis Demandolx, Rakesh Ranjan, Radu Timofte, and Luc Van Gool. Efficient and explicit modelling of image hierarchies for image restoration. In CVPR, 2023b.
  • Liu et al. [2018] Ding Liu, Bihan Wen, Xianming Liu, Zhangyang Wang, and Thomas S Huang. When image denoising meets high-level vision tasks: A deep learning approach. In IJCAI, 2018.
  • Liu et al. [2021] Risheng Liu, Long Ma, Jiaao Zhang, Xin Fan, and Zhongxuan Luo. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In CVPR, 2021.
  • Ma et al. [2022] Long Ma, Tengyu Ma, Risheng Liu, Xin Fan, and Zhongxuan Luo. Toward fast, flexible, and robust low-light image enhancement. In CVPR, 2022.
  • Martin et al. [2001] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • Nah et al. [2017] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, 2017.
  • Patashnik et al. [2021] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
  • Purohit et al. [2021] Kuldeep Purohit, Maitreya Suin, AN Rajagopalan, and Vishnu Naresh Boddeti. Spatially-adaptive image restoration using distortion-guided networks. In ICCV, 2021.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Rim et al. [2020] Jaesung Rim, Haeyun Lee, Jucheol Won, and Sunghyun Cho. Real-world blur dataset for learning and benchmarking deblurring algorithms. In ECCV, 2020.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Shen et al. [2019] Ziyi Shen, Wenguan Wang, Xiankai Lu, Jianbing Shen, Haibin Ling, Tingfa Xu, and Ling Shao. Human-aware motion deblurring. In ICCV, 2019.
  • Shi et al. [2015] Jianping Shi, Li Xu, and Jiaya Jia. Just noticeable defocus blur detection and estimation. In CVPR, 2015.
  • Wan et al. [2022] Renjie Wan, Boxin Shi, Wenhan Yang, Bihan Wen, Ling-Yu Duan, and Alex C Kot. Purifying low-light images via near-infrared enlightened image. TMM, 2022.
  • Wang et al. [2023a] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In AAAI, 2023a.
  • Wang et al. [2021] Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia. Seeing dynamic scene in the dark: A high-quality video dataset with mechatronic alignment. In ICCV, 2021.
  • Wang et al. [2023b] Tao Wang, Kaihao Zhang, Tianrun Shen, Wenhan Luo, Bjorn Stenger, and Tong Lu. Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method. In AAAI, 2023b.
  • Wang et al. [2018] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR, 2018.
  • Wang et al. [2022a] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. Uformer: A general u-shaped transformer for image restoration. In CVPR, 2022a.
  • Wang et al. [2022b] Zejin Wang, Jiazheng Liu, Guoqing Li, and Hua Han. Blind2unblind: Self-supervised image denoising with visible blind spots. In CVPR, 2022b.
  • Wang et al. [2022c] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. In CVPR, 2022c.
  • Wu et al. [2022] Wenhui Wu, Jian Weng, Pingping Zhang, Xu Wang, Wenhan Yang, and Jianmin Jiang. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In CVPR, 2022.
  • Wu et al. [2023] Yuhui Wu, Chen Pan, Guoqing Wang, Yang Yang, Jiwei Wei, Chongyi Li, and Heng Tao Shen. Learning semantic-aware knowledge guidance for low-light image enhancement. In CVPR, 2023.
  • Xu et al. [2022a] Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. Snr-aware low-light image enhancement. In CVPR, 2022a.
  • Xu et al. [2022b] Xiaogang Xu, Yitong Yu, Nianjuan Jiang, Jiangbo Lu, Bei Yu, and Jiaya Jia. Pvdd: A practical video denoising dataset with real-world dynamic scenes. arXiv preprint, 2022b.
  • Xu et al. [2022c] Xiaogang Xu, Hengshuang Zhao, Philip Torr, and Jiaya Jia. General adversarial defense against black-box attacks via pixel level and feature level distribution alignments. arXiv preprint, 2022c.
  • Xu et al. [2023a] Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. Deep parametric 3d filters for joint video denoising and illumination enhancement in video super resolution. In AAAI, 2023a.
  • Xu et al. [2023b] Xiaogang Xu, Ruixing Wang, and Jiangbo Lu. Low-light image enhancement via structure modeling and guidance. In CVPR, 2023b.
  • Xue et al. [2023] Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning unified representation of language, image and point cloud for 3d understanding. In CVPR, 2023.
  • Yang et al. [2017] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain detection and removal from a single image. In CVPR, 2017.
  • Yang et al. [2021] Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep Retinex network for robust low-light image enhancement. TIP, 2021.
  • Yang et al. [2023] Yan Yang, Liyuan Pan, Liu Liu, and Miaomiao Liu. K3dn: Disparity-aware kernel estimation for dual-pixel defocus deblurring. In CVPR, 2023.
  • Zamir et al. [2021] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In CVPR, 2021.
  • Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, 2022.
  • Zhai et al. [2022] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  • Zhang and Patel [2018] He Zhang and Vishal M Patel. Density-aware single image de-raining using a multi-stream dense network. In CVPR, 2018.
  • Zhang et al. [2019] He Zhang, Vishwanath Sindagi, and Vishal M Patel. Image de-raining using a conditional generative adversarial network. TCSVT, 2019.
  • Zhang et al. [2017] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 2017.
  • Zhang et al. [2021] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. TPAMI, 2021.
  • Zhang et al. [2011] Lei Zhang, Xiaolin Wu, Antoni Buades, and Xin Li. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. JEI, 2011.
  • Zhang et al. [2022] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In CVPR, 2022.
  • Zhou et al. [2023] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023.