Training-Free Large Model Priors for Multiple-in-One Image Restoration

Xuanhua He, Lang Li, Yingying Wang, Hui Zheng, Ke Cao, Keyu Yan,
Rui Li, Chengjun Xie, Jie Zhang, Man Zhou
This work was supported by the National Natural Science Foundation of China under grant number 32171888 and HFIPS Director’s Fund under grant No.2023YZGH04 . Xuanhua He and Lang Li contributed equally; Corresponding author: Jie Zhang and Man Zhou; Xuanhua He, Lang Li, Keyu Yan and Ke Cao are with Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031 and also with University of Science and Technology of China, Hefei 230026, (e-mail: hexuanhua, caoke200820, [email protected]); Yingying Wang and Hui Zheng are with Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University, Xiamen 361102, China (e-mail: wangyingying7, [email protected]); Man Zhou is University of Science and Technology of China, China (e-mail:[email protected], [email protected]); Jie Zhang, Rui Li and Chengjun Xie is with the Intelligent Agriculture Engineering Laboratory of Anhui Province, Institute of Intelligent Machines, and Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China (e-mail: zhangjie, lirui, [email protected];);
Abstract

Image restoration aims to reconstruct the latent clear images from their degraded versions. Despite the notable achievement, existing methods predominantly focus on handling specific degradation types and thus require specialized models, impeding real-world applications in dynamic degradation scenarios. To address this issue, we propose Large Model Driven Image Restoration framework (LMDIR), a novel multiple-in-one image restoration paradigm that leverages the generic priors from large multi-modal language models (MMLMs) and the pretrained diffusion models. In detail, LMDIR integrates three key prior knowledges: 1) global degradation knowledge from MMLMs, 2) scene-aware contextual descriptions generated by MMLMs, and 3) fine-grained high-quality reference images synthesized by diffusion models guided by MMLM descriptions. Standing on above priors, our architecture comprises a query-based prompt encoder, degradation-aware transformer block injecting global degradation knowledge, content-aware transformer block incorporating scene description, and reference-based transformer block incorporating fine-grained image priors. This design facilitates single-stage training paradigm to address various degradations while supporting both automatic and user-guided restoration. Extensive experiments demonstrate that our designed method outperforms state-of-the-art competitors on multiple evaluation benchmarks.

Index Terms:
All-in-one Image Restoration, Large Model, Diffusion Model.

I Introduction

Image restoration, a classical low-level vision task, aims to reconstruct the latent high-quality images from their corrupted counterparts affected by various types of degradation, such as rain streaks [30, 14], low-light conditions [12, 32], and noise [21, 5]. Traditional image restoration methods have developed various natural image priors, e.g., low-rank prior and total variation regularization [31, 2] to regularize the solution space of the latent clear image. However, designing and optimizing these priors is challenging, which limits their practical applicability. The advent of deep learning has brought significant advancements in the field of image restoration. However, in real-world scenarios, such as autonomous driving or surveillance monitoring, degradation types can be random and time-varying, resulting in a wide range of distortions across different scenes.

Existing image restoration models are predominantly tailored to handle specific degradation types, necessitating the training of specialized models for each type of degradation [22, 43]. Furthermore, these methods require complex mechanisms to match the input degraded image with the appropriate restoration model. This paradigm impedes the application of image restoration techniques in real-world applications, where the degradation type is often dynamic.

Refer to caption
Figure 1: The overall pipeline of our proposed method. It achieves high-quality multiple-in-one image restoration with large model prior.

Recently, the image restoration community has shifted its focus toward multiple-in-one restoration tasks [19, 28, 23], where a single model is tasked with handling multiple degradation types. This multi-task capability is achieved by injecting degradation-relevant knowledge into the model, enabling it to discriminate between different degradation types and process image features dynamically. The performance of such models heavily relies on the accurate perception of degradation embeddings [17]. A pioneering work in this task, AirNet [19], learns explicit degradation representations through contrastive learning, while DA-CLIP [23] generates accurate embeddings by fine-tuning a pre-trained CLIP [29] model. In contrast to explicit embedding methods, PromptIR [28] and ProRes [24] utilize prompt learning for implicit embedding learning. However, the former explicit embedding methods typically require a two-stage training approach, consuming large computational resources, especially when fine-tuning large pre-trained models. On the other hand, the latter implicit embedding approaches struggle to generate accurate representations, and the training process itself can be challenging [17].

The recent emergence of multi-modal large language models (MMLM) [42] offers a promising solution to address these challenges. These models, trained on large-scale image-text-paired datasets, possess strong capabilities in image captioning, visual question answering, and scene understanding. Notably, they have demonstrated a powerful ability to comprehend low-level image features, as evidenced by their performance on the Q-Bench [38] benchmark. Leveraging this understanding, MMLMs can vividly and accurately describe image degradations and contents, providing reliable prior information to restoration models without the need for complex fine-tuning or multi-stage training procedures. In addition to the global prior derived from textual descriptions, local fine-grained priors obtained from reference images can further enhance the performance of restoration models. This approach has been explored in the context of reference-based super-resolution tasks [15]. Leveraging the powerful generative capabilities of state-of-the-art diffusion models [27], we can synthesize high-quality reference images that share similar content and semantic context with the input degraded image. These reference images are generated in a guided manner, informed by the contextual text descriptions produced by the multi-modal language models.

Motivated by the observations discussed earlier, we propose a novel multiple-in-one image restoration framework, dubbed LMDIR (Large Model Driven Image Restoration Framework), that leverages prior knowledge from large multi-modal models to tackle diverse image degradations. As illustrated in Figure 1, LMDIR incorporates three essential priors: 1) global degradation knowledge derived from MMLMs; 2) scene-aware contextual descriptions generated by MMLMs; and 3) fine-grained high-quality reference images synthesized by diffusion models guided by the MMLM-generated contextual descriptions. Building upon these priors, the proposed LMDIR architecture consists of four main components: a customized query-based prompt encoder that refines textual information from MMLMs by leveraging image low-level features, a degradation-aware transformer block that incorporates global degradation knowledge, a content-aware transformer block that utilizes the scene-aware content descriptions, and a reference-based transformer block that integrates fine-grained image priors from the synthesized reference images through global and local perspectives. This design empowers LMDIR to adopt a single-stage training strategy capable of addressing diverse and complex image restoration tasks, while also offering the flexibility of automatic or user-guided restoration based on provided prompts. Extensive experiments validate the superiority of LMDIR over other state-of-the-art multiple-in-one image restoration methods across multiple evaluation benchmarks.

Our key contributions can be summarized as follows:

  1. 1.

    We introduce LMDIR, an innovative framework that harnesses the capabilities of multi-modal large language models and diffusion models to address the challenges of multiple-in-one image restoration.Extensive experiments have shown that LMDIR outperforms state-of-the-art methods for multiple-in-one image restoration tasks.

  2. 2.

    We introduced a query-based prompt encoder that refines text from multi-modal large language models (MMLMs), enabling automatic or user-guided restoration. We also designed degradation-aware transformer blocks to incorporate global degradation knowledge, enhancing the model’s capability to handle diverse types of degradations. Additionally, we utilized reference-based transformer blocks that leverage fine-grained image priors from synthesized reference images, further improving the quality of image restoration.

  3. 3.

    Through extensive experiments, we demonstrate that LMDIR outperforms existing state-of-the-art methods on multiple evaluation metrics for multiple-in-one image restoration tasks.

II Related Work

II-A Multiple-in-one Image Restoration

Image restoration endeavors to reconstruct high-quality images from degraded versions that have been impacted by a range of degradations, including noise [21, 5], rain [30, 14], low-light conditions [12, 32], and other factors [9, 35]. Each degradation type exhibits unique characteristics and introduces distinct distortions during the imaging process. Consequently, previous studies have predominantly focused on designing specialized models tailored to handle specific restoration tasks by leveraging prior knowledge about the respective degradations. However, this approach limits the applicability of such models in real-world scenarios, where the degradation type can be dynamic and time-varying. Recently, the image restoration community has shifted its attention towards multiple-in-one restoration tasks, which involve developing a single model capable of handling various types of degradation. Early attempts in this direction employed networks with multiple encoders and decoders, where different encoder-decoder pairs were dedicated to specific degradation types [6, 20]. However, these methods required prior knowledge of the degradation type and were primarily focused on addressing adverse weather conditions. AirNet [19] introduced a two-stage training approach that combined an explicit degradation classifier with contrastive learning to adaptively recognize degradation types and simultaneously perform denoising, rain removal, and image dehazing tasks. Subsequently, PromptIR [28] and ProRes [24] leveraged prompt learning techniques [46] to achieve implicit degradation representation learning, eliminating the need for separate degradation classifiers and two-stage training. DA-CLIP [23], on the other hand, fine-tuned a pre-trained CLIP model to generate accurate degradation embeddings, which were then injected into the restoration network. Explicit classification methods in image restoration typically rely on computationally expensive two-stage training procedures. Further, implicit prompt learning methods often encounter difficulties in generating accurate representations of degradations, and the training process itself can be challenging.

II-B Text Driven Image Manipulation

In recent years, significant advancements have been made in text-based image generation and editing [18]. VQGAN-CLIP [10] combines pre-trained generative models and CLIP to guide the generation process toward a desired target description. Additionally, latent diffusion models [33] have been introduced, which can effectively follow user instructions and improve image quality through text guidance.

Beyond image generation, some tasks have also explored user-guided image editing and painting, such as InstructPix2Pix [4] and Imagic [16]. The progress of diffusion models led to the development of more sophisticated models such as Emu Edit [34]. This approach not only processes standard image inputs but also incorporates depth maps. Concurrently, methods like LEDITS++ [3] have pushed the boundaries of image generation fidelity by leveraging the power of DDPM inversion. However, the field of image restoration has not fully explored the potential of text-driven image restoration.

Refer to caption
Figure 2: The framework of our network. We utilized the pretrained MLLM, CLIP and diffusion models for generating the prior information to guide the restoration process.

II-C Reference-based Image Super-Resolution

Another area related to our work is reference-based super-resolution, which differs from single image super-resolution in that it leverages reference images to assist in the super-resolution process by extracting similar textures and details. The reference images are highly similar to the groundtruth high-resolution images. Representative works in this field including CrossNet [45], for instance, establishes inter-image correlations by estimating optical flow between reference and low-resolution images. SRNTT [44] computes similarities between these images and transfers textures from the reference to enhance the low-resolution counterparts. Furthermore, TTSR [39] introduces both hard and soft attention mechanisms to facilitate texture transfer and synthesis. Lastly, C2-Matching [15] pioneers the use of a contrasting correlation network to learn image correlations, followed by a teacher-student correlation distribution to refine the alignment between low-resolution and high-resolution images, thereby enhancing the overall quality of super-resolved images. While significant progress has been made in the reference-based super-resolution field, these methods require manual selection of reference images by users. In contrast, our approach leverages a diffusion model to adaptively generate reference images with highly similar content to the ground truth image, and these generated reference images are then used to improve the performance of image restoration models.

III Method

In this section, we first introduce the three large model priors utilized, followed by a detailed description of the proposed framework.

III-A Large Model Prior

III-A1 Global degradation and content prior

Unlike previous methods that require fine-tuning on large models or two-stage training, we obtain content and degradation embeddings in a training-free manner. We leverage prompt engineering to query a multimodal language model, which outputs degradation information present in the image as well as content information unrelated to degradation. We then obtain the corresponding embeddings of the generated text using the CLIP encoder, serving as global degradation and content embeddings to guide the model’s training. We utilize the GPT4o model [1], which has demonstrated strong performance in low-level tasks, to generate this global degradation priors 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and content text embedding 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, shown in the top of Figure 2.

III-A2 Local content prior

In addition to the global prior knowledge provided by text, we also utilize images generated by the diffusion model as fine-grained content priors, providing detailed texture and feature references for image restoration models. Specifically, we input the content text output by the multimodal large language model into the SDXL [27] model as a prompt and use the degraded text as a negative prompt, ensuring that the generated image shares similar content with the ground truth.

III-B Model Architecture

Figure 2 illustrates the overall framework of our proposed method, comprising an image restoration network, a query-based prompt encoder, a multi-modal language model, a diffusion model, and a CLIP encoder. Given a degraded input image 𝐈H×W×3𝐈superscriptHW3\mathbf{I}\in\mathbb{R}^{\rm H\times W\times 3}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT roman_H × roman_W × 3 end_POSTSUPERSCRIPT, we first pass it and a prompt text through the multi-modal language model (MLLM) to generate a degradation text embedding 𝐓dsubscript𝐓𝑑\mathbf{T}_{d}bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and a content text embedding 𝐓csubscript𝐓𝑐\mathbf{T}_{c}bold_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, respectively. These text embeddings are then encoded by the CLIP encoder to obtain a degradation embedding 𝐞dN×Csubscript𝐞𝑑superscriptNC\mathbf{e}_{d}\in\mathbb{R}^{\rm N\times C}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_N × roman_C end_POSTSUPERSCRIPT and a content embedding 𝐞cN×Csubscript𝐞𝑐superscriptNC\mathbf{e}_{c}\in\mathbb{R}^{\rm N\times C}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_N × roman_C end_POSTSUPERSCRIPT. Concurrently, the input image 𝐈𝐈\mathbf{I}bold_I is processed by a simple image encoder built on residual blocks to obtain a degraded image representation 𝐈dCsubscript𝐈𝑑superscriptC\mathbf{I}_{d}\in\mathbb{R}^{\rm C}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT.

We feed the degradation encoding 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and the identity encoding 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT into the query-based prompt encoder to refine the degradation representation as 𝐙dN×Csubscript𝐙𝑑superscriptNC\mathbf{Z}_{d}\in\mathbb{R}^{\rm N\times C}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_N × roman_C end_POSTSUPERSCRIPT. Concurrently, we input the content encoding 𝐓csubscript𝐓𝑐\mathbf{T}_{c}bold_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to the diffusion model to synthesize a high-quality reference image 𝐈rH×W×3subscript𝐈𝑟superscriptHW3\mathbf{I}_{r}\in\mathbb{R}^{\rm H\times W\times 3}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_H × roman_W × 3 end_POSTSUPERSCRIPT. Finally, the backbone restoration network takes the refined degradation representation 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the content encoding 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and the reference image 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT as conditions to restore the output image 𝐘H×W×3𝐘superscriptHW3\mathbf{Y}\in\mathbb{R}^{\rm H\times W\times 3}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT roman_H × roman_W × 3 end_POSTSUPERSCRIPT from the degraded input 𝐈𝐈\mathbf{I}bold_I.

Our framework effectively integrates global degradation priors 𝐓dsubscript𝐓𝑑\mathbf{T}_{d}bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and scene-aware content priors 𝐓csubscript𝐓𝑐\mathbf{T}_{c}bold_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT extracted from the multi-modal language model (MLLM), as well as fine-grained reference priors 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT generated by the diffusion model. This integrated approach enables robust multiple-in-one image restoration capabilities, leveraging complementary information from the language and diffusion models to tackle a variety of image degradation challenges.

III-C Key Components

III-C1 Query-based Prompt Encoder

The degradation embedding 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT extracted directly from the CLIP encoder cannot be directly applied as a prior for the image restoration network due to two reasons: 1) The CLIP encoder lacks awareness of specific degradation details such as rain streaks and noise distribution, providing only global classification knowledge. 2) The textual description generated by the multi-modal language model may not be entirely reliable. Therefore, we design a query-based prompt encoder to refine 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT into a more fine-grained degradation representation 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT that can effectively guide the restoration network, while incorporating degradation information from the image itself. Specifically, given a learnable query embedding 𝐄pN^×Csubscript𝐄𝑝superscript^NC\mathbf{E}_{p}\in\mathbb{R}^{\rm\hat{N}\times C}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT over^ start_ARG roman_N end_ARG × roman_C end_POSTSUPERSCRIPT, the degradation text embedding 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT from CLIP, and the degraded image representation 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the query-based prompt encoder computes the refined degradation representation 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. In detail, 𝐄psubscript𝐄𝑝\mathbf{E}_{p}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT attends to itself via SA(.)\textrm{SA}(.)SA ( . ) to obtain 𝐄psuperscriptsubscript𝐄𝑝\mathbf{E}_{p}^{\prime}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is projected to queries 𝐐Epsubscript𝐐subscript𝐸𝑝\mathbf{Q}_{E_{p}}bold_Q start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, cross-attention is performed between 𝐐Epsubscript𝐐subscript𝐸𝑝\mathbf{Q}_{E_{p}}bold_Q start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT and keys/values from 𝐞dsubscript𝐞𝑑\mathbf{e}_{d}bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to obtain 𝐙textsubscript𝐙text\mathbf{Z}_{\text{text}}bold_Z start_POSTSUBSCRIPT text end_POSTSUBSCRIPT encoding degradation information from text, and with 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to obtain 𝐙imagesubscript𝐙image\mathbf{Z}_{\text{image}}bold_Z start_POSTSUBSCRIPT image end_POSTSUBSCRIPT encoding image degradation information as:

𝐄p=SA(𝐄p),superscriptsubscript𝐄𝑝SAsubscript𝐄𝑝\displaystyle\mathbf{E}_{p}^{\prime}=\textrm{SA}(\mathbf{E}_{p}),bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = SA ( bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (1)
𝐐Ep=𝐄pWqp,subscript𝐐subscript𝐸𝑝superscriptsubscript𝐄𝑝subscript𝑊𝑞𝑝\displaystyle\mathbf{Q}_{E_{p}}=\mathbf{E}_{p}^{\prime}W_{qp},bold_Q start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_q italic_p end_POSTSUBSCRIPT , (2)
𝐊ed,𝐕ed=𝐞dWkd,𝐞dWvd,formulae-sequencesubscript𝐊subscript𝑒𝑑subscript𝐕subscript𝑒𝑑subscript𝐞𝑑subscript𝑊𝑘𝑑subscript𝐞𝑑subscript𝑊𝑣𝑑\displaystyle\mathbf{K}_{e_{d}},\mathbf{V}_{e_{d}}=\mathbf{e}_{d}W_{kd},% \mathbf{e}_{d}W_{vd},bold_K start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v italic_d end_POSTSUBSCRIPT , (3)
𝐊Id,𝐕Id=𝐈dWki,𝐈dWvi,formulae-sequencesubscript𝐊subscript𝐼𝑑subscript𝐕subscript𝐼𝑑subscript𝐈𝑑subscript𝑊𝑘𝑖subscript𝐈𝑑subscript𝑊𝑣𝑖\displaystyle\mathbf{K}_{I_{d}},\mathbf{V}_{I_{d}}=\mathbf{I}_{d}W_{ki},% \mathbf{I}_{d}W_{vi},bold_K start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT , (4)

where SA(.)\textrm{SA}(.)SA ( . ) and CA(.)\textrm{CA}(.)CA ( . ) denote self-attention and cross-attention, respectively. Finally, 𝐙textsubscript𝐙text\mathbf{Z}_{\text{text}}bold_Z start_POSTSUBSCRIPT text end_POSTSUBSCRIPT and 𝐙imagesubscript𝐙image\mathbf{Z}_{\text{image}}bold_Z start_POSTSUBSCRIPT image end_POSTSUBSCRIPT are fused and processed by a feed-forward network (FFN) [11] to yield the refined degradation representation Zdsubscript𝑍𝑑Z_{d}italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as

𝐙text=CA(𝐐Ep,𝐊ed,𝐕ed),subscript𝐙textCAsubscript𝐐subscript𝐸𝑝subscript𝐊subscript𝑒𝑑subscript𝐕subscript𝑒𝑑\displaystyle\mathbf{Z}_{\text{text}}=\textrm{CA}(\mathbf{Q}_{E_{p}},\mathbf{K% }_{e_{d}},\mathbf{V}_{e_{d}}),bold_Z start_POSTSUBSCRIPT text end_POSTSUBSCRIPT = CA ( bold_Q start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (5)
𝐙image=CA(𝐐Ep,𝐊Id,𝐕Id),subscript𝐙imageCAsubscript𝐐subscript𝐸𝑝subscript𝐊subscript𝐼𝑑subscript𝐕subscript𝐼𝑑\displaystyle\mathbf{Z}_{\text{image}}=\textrm{CA}(\mathbf{Q}_{E_{p}},\mathbf{% K}_{I_{d}},\mathbf{V}_{I_{d}}),bold_Z start_POSTSUBSCRIPT image end_POSTSUBSCRIPT = CA ( bold_Q start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (6)
𝐙d=FFN(𝐙text+𝐙image).subscript𝐙𝑑FFNsubscript𝐙textsubscript𝐙image\displaystyle\mathbf{Z}_{d}=\textrm{FFN}(\mathbf{Z}_{\text{text}}+\mathbf{Z}_{% \text{image}}).bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = FFN ( bold_Z start_POSTSUBSCRIPT text end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT image end_POSTSUBSCRIPT ) . (7)

This representation 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT combines the information from text prior and the image feature, can provide restoration network with a better degradation presentation.

Refer to caption
Figure 3: Our proposed content aware Tranformer block and degradation aware transformer block. These blocks are utilized to inject prior knowledge from content and degradation prior.

III-C2 Degradation-Aware Transformer Block

In the encoder part of our image restoration model, we employ degradation-aware transformer blocks to inject degradation information and enable dynamic feature processing based on the degradation type. Specifically, each degradation-aware transformer block consists of three components: transposed self-attention [43], gated feed-forward network [43], and a degradation embedding adapter, as shown in Figure. 3. Given the input feature map 𝐅𝐢subscript𝐅𝐢\mathbf{F_{i}}bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT and degradation embedding 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the operations are defined as follows:

𝒢𝒜,𝒢,γa,βa,γf,βf=DEA(𝐙d),subscript𝒢𝒜subscript𝒢subscript𝛾𝑎subscript𝛽𝑎subscript𝛾𝑓subscript𝛽𝑓DEAsubscript𝐙𝑑\displaystyle\mathcal{G_{A}},\mathcal{G_{F}},\gamma_{a},\beta_{a},\gamma_{f},% \beta_{f}=\textrm{DEA}(\mathbf{Z}_{d}),caligraphic_G start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = DEA ( bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , (8)
F~i=𝒢𝒜TSA(γaFi+βa)+Fi,subscript~𝐹𝑖direct-productsubscript𝒢𝒜TSAdirect-productsubscript𝛾𝑎subscript𝐹𝑖subscript𝛽𝑎subscript𝐹𝑖\displaystyle\tilde{F}_{i}=\mathcal{G_{A}}\odot\textrm{TSA}(\gamma_{a}\odot F_% {i}+\beta_{a})+F_{i},over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ⊙ TSA ( italic_γ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⊙ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) + italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (9)
F^i=𝒢GFN(γfF~i+βf)+F~isubscript^𝐹𝑖direct-productsubscript𝒢GFNdirect-productsubscript𝛾𝑓subscript~𝐹𝑖subscript𝛽𝑓subscript~𝐹𝑖\displaystyle\hat{F}_{i}=\mathcal{G_{F}}\odot\textrm{GFN}(\gamma_{f}\odot% \tilde{F}_{i}+\beta_{f})+\tilde{F}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ⊙ GFN ( italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) + over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (10)

where transposed self-attention TSA(.)\textrm{TSA}(.)TSA ( . ) captures long-range dependencies in Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Gated forward network GFN(.)\textrm{GFN}(.)GFN ( . ) refines the local feature. The degradation embedding adapter DEA(.)\textrm{DEA}(.)DEA ( . ) projects 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to the same channel dimension as F^isubscript^𝐹𝑖\hat{F}_{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and further generating degradation-aware parameters. Specifically, degradation adapter generate the degradation-aware parameters using:

Z~d=SiLU(WadaptZd),subscript~𝑍𝑑SiLUsubscript𝑊adaptsubscript𝑍𝑑\displaystyle\tilde{Z}_{d}=\text{SiLU}(W_{\text{adapt}}Z_{d}),over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = SiLU ( italic_W start_POSTSUBSCRIPT adapt end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , (11)
E=WlinearZ~d,𝐸subscript𝑊linearsubscript~𝑍𝑑\displaystyle E=W_{\text{linear}}\tilde{Z}_{d},italic_E = italic_W start_POSTSUBSCRIPT linear end_POSTSUBSCRIPT over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , (12)
(𝒢𝒜,𝒢,γa,βa,γf,βf)=split(E,6).subscript𝒢𝒜subscript𝒢subscript𝛾𝑎subscript𝛽𝑎subscript𝛾𝑓subscript𝛽𝑓split𝐸6\displaystyle(\mathcal{G_{A}},\mathcal{G_{F}},\gamma_{a},\beta_{a},\gamma_{f},% \beta_{f})=\text{split}(E,6).( caligraphic_G start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) = split ( italic_E , 6 ) . (13)

where γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β and 𝒢𝒢\mathcal{G}caligraphic_G are scale, shift and gate parameters modulated by 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Split(.)Split(.)italic_S italic_p italic_l italic_i italic_t ( . ) is the split operator along channel dimension. By integrating 𝐙dsubscript𝐙𝑑\mathbf{Z}_{d}bold_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT into the transformer blocks, the model can dynamically adapt its feature processing based on specific degradation representation, enabling effective restoration for diverse degradations using a single model.

III-C3 Content-aware Transformer Block

In the bottleneck parts of our image restoration network, we designed content-aware transformer blocks to incorporate local content features and enhance restoration performance. In the bottleneck, we utilize the content text embedding 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as a reference. As shown in Figure 3. We first project 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to the same dimension as the feature map 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a multi-layer perceptron. Then, we perform self-attention on 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and calculate the similarity between 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the projected 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Based on this similarity, we adaptively select and integrate useful features from 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, followed by a gated FFN for local feature processing. This operation injects global content priors from the text embedding into the network. The content-aware transformer block can be formulated as:

𝐅~i=TSA(𝐅i)+𝐅i,subscript~𝐅𝑖TSAsubscript𝐅𝑖subscript𝐅𝑖\displaystyle\tilde{\mathbf{F}}_{i}=\textrm{TSA}(\mathbf{F}_{i})+\mathbf{F}_{i},over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = TSA ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (14)
𝐅^i=RA(𝐅~i,𝐞c)+𝐅~i,subscript^𝐅𝑖RAsubscript~𝐅𝑖subscript𝐞𝑐subscript~𝐅𝑖\displaystyle\hat{\mathbf{F}}_{i}=\textrm{RA}(\tilde{\mathbf{F}}_{i},\mathbf{e% }_{c})+\tilde{\mathbf{F}}_{i},over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = RA ( over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (15)
𝐅i+1=GFN(𝐅^i).subscript𝐅𝑖1GFNsubscript^𝐅𝑖\displaystyle\mathbf{F}_{i+1}=\textrm{GFN}(\hat{\mathbf{F}}_{i}).bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = GFN ( over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (16)

Here we utilized the reference-attention RA(.)RA(.)italic_R italic_A ( . ) to inject the reference feature, due to the token length of 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a fixed number. The integrated features 𝐅^isubscript^𝐅𝑖\hat{\mathbf{F}}_{i}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are further processed by a gated FFN to produce the output 𝐅i+1subscript𝐅𝑖1\mathbf{F}_{i+1}bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Given the input feature 𝐅~isubscript~𝐅𝑖\tilde{\mathbf{F}}_{i}over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the reference feature denoted as 𝐞csubscript𝐞𝑐\mathbf{e}_{c}bold_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the operation of RA(.)\textrm{RA}(.)RA ( . ) can be defined as follows:

Q=𝐅~iWq,𝑄subscript~𝐅𝑖subscript𝑊𝑞\displaystyle Q=\tilde{\mathbf{F}}_{i}W_{q},italic_Q = over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , (17)
K,V=𝐅RefWk,𝐅RefWv,formulae-sequence𝐾𝑉subscript𝐅𝑅𝑒𝑓subscript𝑊𝑘subscript𝐅𝑅𝑒𝑓subscript𝑊𝑣\displaystyle K,V=\mathbf{F}_{Ref}W_{k},\mathbf{F}_{Ref}W_{v},italic_K , italic_V = bold_F start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (18)
Sim=softmax(QKT),𝑆𝑖𝑚softmax𝑄superscript𝐾𝑇\displaystyle Sim=\texttt{softmax}(QK^{T}),italic_S italic_i italic_m = softmax ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , (19)
𝐅out=SimV.subscript𝐅𝑜𝑢𝑡𝑆𝑖𝑚𝑉\displaystyle\mathbf{F}_{out}=Sim*V.bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_S italic_i italic_m ∗ italic_V . (20)
Refer to caption
Figure 4: Our proposed reference based Tranformer block. This block incorporates details from reference image through local and global reference attention.

III-C4 Reference-based Transformer Block

In the decoder parts of our image restoration network, we introduce reference-based transformer blocks to integrate fine-grained reference features 4. Specifically, we leverage the reference image 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, generated by the diffusion model, as our reference. This block is designed to extract both global and local similar features from the reference image. To achieve this, we employ a hybrid approach that combines global and local attention mechanisms. The global reference attention utilizes a transposed cross-attention mechanism to compute the similarity between the two images along the channel dimension. In contrast, the local reference attention employs convolution to fuse similarity features along the spatial dimension. Given the input feature 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and reference image 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, this process can be described as follows:

𝐅ref=ϕ(𝐈r),subscript𝐅𝑟𝑒𝑓italic-ϕsubscript𝐈𝑟\displaystyle\mathbf{F}_{ref}=\phi(\mathbf{I}_{r}),bold_F start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = italic_ϕ ( bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (21)
𝐅~i=TSA(𝐅i)+𝐅i,subscript~𝐅𝑖TSAsubscript𝐅𝑖subscript𝐅𝑖\displaystyle\tilde{\mathbf{F}}_{i}=\textrm{TSA}(\mathbf{F}_{i})+\mathbf{F}_{i},over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = TSA ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (22)
𝐅~il,𝐅~ig=split(𝐅~i,2),subscriptsuperscript~𝐅𝑙𝑖subscriptsuperscript~𝐅𝑔𝑖splitsubscript~𝐅𝑖2\displaystyle\tilde{\mathbf{F}}^{l}_{i},\tilde{\mathbf{F}}^{g}_{i}=\textrm{% split}(\tilde{\mathbf{F}}_{i},2),over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = split ( over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 ) , (23)
𝐅^i=Θ([LRA(𝐅~il,𝐅ref),GRA(𝐅~ig,𝐅ref)])+𝐅~i,subscript^𝐅𝑖ΘLRAsubscriptsuperscript~𝐅𝑙𝑖subscript𝐅𝑟𝑒𝑓GRAsubscriptsuperscript~𝐅𝑔𝑖subscript𝐅𝑟𝑒𝑓subscript~𝐅𝑖\displaystyle\hat{\mathbf{F}}_{i}=\Theta([\textrm{LRA}(\tilde{\mathbf{F}}^{l}_% {i},\mathbf{F}_{ref}),\textrm{GRA}(\tilde{\mathbf{F}}^{g}_{i},\mathbf{F}_{ref}% )])+\tilde{\mathbf{F}}_{i},over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Θ ( [ LRA ( over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) , GRA ( over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ] ) + over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (24)
𝐅i+1=GFN(𝐅^i)+𝐅^i.subscript𝐅𝑖1GFNsubscript^𝐅𝑖subscript^𝐅𝑖\displaystyle\mathbf{F}_{i+1}=\textrm{GFN}(\hat{\mathbf{F}}_{i})+\hat{\mathbf{% F}}_{i}.bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = GFN ( over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (25)

Here, ϕ(.)\phi(.)italic_ϕ ( . ) is the convolution operator that projects 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to 𝐅refsubscript𝐅𝑟𝑒𝑓\mathbf{F}_{ref}bold_F start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT for dimension alignment. The TSA(.)\textrm{TSA}(.)TSA ( . ) and split(.)\textrm{split}(.)split ( . ) are the transposed self-attention and channel split operators, respectively. After generating the outputs from local reference attention LRA(.)\textrm{LRA}(.)LRA ( . ) and global reference attention GRA(.)\textrm{GRA}(.)GRA ( . ), the two features are concatenated and fused through the linear projection Θ(.)\Theta(.)roman_Θ ( . ). Finally, a gated forward network, GFN(.)\textrm{GFN}(.)GFN ( . ), is utilized to enhance the locality of the features. The GRA(.)\textrm{GRA}(.)GRA ( . ) is the cross attention version of TSA(.)\textrm{TSA}(.)TSA ( . ), where Q𝑄Qitalic_Q is derived from 𝐅~isubscript~𝐅𝑖\tilde{\mathbf{F}}_{i}over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and K,V𝐾𝑉K,Vitalic_K , italic_V are generated from 𝐅refsubscript𝐅𝑟𝑒𝑓\mathbf{F}_{ref}bold_F start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT. The local reference attention can be described as below:

𝐅j=𝐖2ReLU(𝐖1𝐅~il)subscript𝐅𝑗subscript𝐖2ReLUsubscript𝐖1subscriptsuperscript~𝐅𝑙𝑖\displaystyle\mathbf{F}_{j}=\mathbf{W}_{2}*\text{ReLU}(\mathbf{W}_{1}*\tilde{% \mathbf{F}}^{l}_{i})bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ ReLU ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ over~ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (26)
𝐅k=𝐖2ReLU(𝐖1𝐅ref)subscript𝐅𝑘subscript𝐖2ReLUsubscript𝐖1𝐅𝑟𝑒𝑓\displaystyle\mathbf{F}_{k}=\mathbf{W}_{2}*\text{ReLU}(\mathbf{W}_{1}*\mathbf{% F}{ref})bold_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ ReLU ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ bold_F italic_r italic_e italic_f ) (27)
Sim=Softmax(𝐖a(𝐅j+𝐅k))𝑆𝑖𝑚Softmaxsubscript𝐖𝑎subscript𝐅𝑗subscript𝐅𝑘\displaystyle Sim=\text{Softmax}(\mathbf{W}_{a}*(\mathbf{F}_{j}+\mathbf{F}_{k}))italic_S italic_i italic_m = Softmax ( bold_W start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∗ ( bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) (28)
𝐅agg=𝐅j+Sim𝐅ksubscript𝐅aggsubscript𝐅𝑗direct-product𝑆𝑖𝑚subscript𝐅𝑘\displaystyle\mathbf{F}_{\text{agg}}=\mathbf{F}_{j}+Sim\odot\mathbf{F}_{k}bold_F start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_S italic_i italic_m ⊙ bold_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (29)

where (𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), (𝐖2subscript𝐖2\mathbf{W}_{2}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and (𝐖asubscript𝐖𝑎\mathbf{W}_{a}bold_W start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) are convolutional filters, ( * ) denotes convolution operation, (direct-product\odot) denotes element-wise multiplication, and (ReLU) and (Softmax) are the activation function and softmax function, respectively.

III-D Loss Function

Following the widely-adapted methods, we utilized L1 norm between the output Y𝑌Yitalic_Y and groundtruth G𝐺Gitalic_G as our loss function:

L=YG1𝐿subscriptnorm𝑌𝐺1L=||Y-G||_{1}italic_L = | | italic_Y - italic_G | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (30)
TABLE I: Quantitative comparison of our method with other state-of-the-art approaches in noise-rain-lowlight settings. PSNR/SSIM values are reported. The best results are marked in bold.
Method Denoise(BSD68) Denoise(Urban100) Derain lowlight Average
σ=15𝜎15\sigma=15italic_σ = 15 σ=25𝜎25\sigma=25italic_σ = 25 σ=50𝜎50\sigma=50italic_σ = 50 σ=15𝜎15\sigma=15italic_σ = 15 σ=25𝜎25\sigma=25italic_σ = 25 σ=50𝜎50\sigma=50italic_σ = 50
HINet 32.35/0.925 26.09/0.869 25.91/0.767 33.68/0.938 30.63/0.908 27.50/0.850 37.63/0.980 16.55/0.769 28.79/0.875
NAFNet 32.93/0.915 30.36/0.862 27.22/0.759 31.98/0.920 29.56/0.881 26.24/0.795 32.22/0.939 20.72/0.777 28.90/0.856
SwinIR 33.62/0.926 31.00/0.879 27.68/0.780 33.57/0.938 31.13/0.906 27.60/0.835 34.32/0.965 18.86/0.800 29.72/0.878
Restormer 33.67/0.924 31.07/0.876 27.86/0.782 33.46/0.934 31.09/0.904 27.80/0.837 36.55/0.974 21.49/0.822 30.37/0.881
AirNet 33.66/0.923 31.10/0.881 27.72/0.780 33.55/0.937 31.10/0.905 27.77/0.837 35.80/0.971 16.21/0.673 29.61/0.863
PromptIR 33.63/0.927 31.02/0.880 27.77/0.782 33.45/0.937 31.05/0.907 27.71/0.839 36.37/0.975 21.14/0.831 30.27/0.884
DA-CLIP 30.30/0.837 27.54/0.758 24.77/0.619 29.30/0.819 25.18/0.634 23.71/0.613 36.37/0.965 19.06/0.789 27.03/0.754
Ours 34.00/0.930 31.38/0.886 28.15/0.798 34.15/0.945 31.84/0.919 28.62/0.873 38.64/0.983 23.24/0.850 31.25/0.898
Refer to caption
Figure 5: Visual comparison of multiple-in-one methods on image denoising, low light enhancement, and deraining.
Refer to caption
Figure 6: Visual comparison of multiple-in-one methods on image denoising, low light enhancement, and deraining.

IV Experiments

IV-A Datasets and Benchmark

We evaluate our method on a multiple-in-one image restoration task comprising three representative subtasks: image deraining, image denoising, and low-light image enhancement. For image deraining datasets, we chose the Rain1800 [40] dataset for training and evaluate on 100 test images from the Rain100L [41] dataset. For denoising, we use synthetically generated noisy image with noise level of σ{15,25,50}𝜎152550\sigma\in\{15,25,50\}italic_σ ∈ { 15 , 25 , 50 } on the WED [25] dataset for training, and evaluate on the Urban100 [13] and BSD68 [26] datasets. For low-light enhancement, we train on the LOL [37] dataset and test on its corresponding test set. During training, we randomly sample these three datasets with a uniform distribution. We compare our method against classic image restoration networks (HINet [8], NAFNet [7], SwinIR [22], Restormer [43]) and recent multiple-in-one approaches (AirNet [19], PromptIR [28], DA-CLIP [23]). We adopt PSNR and SSIM to assess the performance of model.

IV-B Implementation Details

We train our model using the PyTorch framework on a single NVIDIA RTX 3090 GPU with the Adam optimizer. During training, images are randomly cropped into 128×128 patches with a batch size of 2. The total number of training iterations is 300000. The initial learning rate is set to 2e-4 for the whole training process.

To generate degradation text 𝐓𝐝subscript𝐓𝐝\mathbf{T_{d}}bold_T start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT and content text 𝐓𝐜subscript𝐓𝐜\mathbf{T_{c}}bold_T start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT, we utilize the GPT4o multi-modal language model. For synthesizing reference images 𝐈𝐫subscript𝐈𝐫\mathbf{I_{r}}bold_I start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT, we employ the Stable Diffusion XL (SDXL) v1.0 diffusion model with 30 sampling steps. We generate all the reference image and text descriptions before training our model.

To ensure a fair comparison, we retrain all baseline models using the same framework from PromptIR [28] and identical hyperparameters.

IV-C Comparison with Sota Methods

IV-C1 Multiple-in-one restoration evaluation

In Table I, we present a comparison between our proposed LMDIR approach and existing state-of-the-art methods, demonstrating substantial enhancements across various tasks. Notably, in comparison to PromptIR, our method achieved an average improvement of 2.3 dB in PSNR of the image deraining task. Additionally, the denoising and low-light image enhancement tasks exhibited marked progress. PromptIR’s inability to produce accurate restoration outcomes can be attributed to its implicit degenerate feature learning method. It is noteworthy to highlight that DA-CLIP necessitates an extensive volume of data for training due to its reliance on diffusion models and fails to yield satisfactory results within our settings. Contrasting with these methods, our approach leverages the prior knowledge provided by the large model and the information intrinsically present in the degraded image, resulting in superior performance.

The comparative visualization of different methods is shown in Figures 5 and 6. For each task, we opted for two representive images for visual comparison. Within the the denoising task, we set the noise level σ𝜎\sigmaitalic_σ=50 for comparison. As the figure depicted, our methods outperforms others in achieving superior restoration outcomes. In the context of denoising, DA-CLIP falls short in complete noise reducing, whereas PromptIR induces a loss of high-frequency details within the image. In the low-light image enhancement task, the color accuracy of our method closely aligns with the ground truth, while the results generated by AirNet manifest a dark texture. Evaluating the image deraining results, residual rain streaks are discernible in the images produced by AirNet and PromptIR. In contrast, our method exhibits the highest quality in rain removal results.

Refer to caption
Figure 7: Visual comparison of multiple-in-one methods on OOD dataset.
Refer to caption
Figure 8: The reference image generated by the diffusion model, given the content text as prompt.

IV-C2 Model generalization performance

Furthermore, we conducted an additional evaluation of the generalization capabilities of various multi-in-one restoration models on out-of-distribution (OOD) data, thereby evaluating their practical performance in real-world applications. We analyzed the impact of varying noise and rain streak intensities on image restoration tasks. More specifically, for the denoising task, we selected two distinct noise level, 60 and 75. In the context of image deraining, we opted for the Rain100H [41] and Test100 [41] datasets as our testing datasets, both of which differ substantially from Rain100L and feature more intense rainfall conditions. The results of these experiments are presented in Table II and Figure 7. Our results indicate that the performance of both AirNet and DA-CLIP, whose degradation knowledge is solely based on limited classification knowledge, significantly diminishes when confronted with OOD data. In contrast, the implicit degradation representation of PromptIR exhibits a certain degree of adaptability to OOD data, thereby outperforming AirNet in OOD dataset significantly. Our proposed methods, which combines the prior knowledge of large-scale models with the inherited information present in degraded images, demonstrates an enhanced performance in the presence of OOD data.

TABLE II: Performance on unseen noise level of (σ𝜎\sigmaitalic_σ = 60, 75) and severe rain conditions from the Rain100H and test100 dataset. PSNR/SSIM values are reported. The best results are marked in bold.
Method Denoise(BSD68) Denoise(Urban100) Derain(Rain100H) Derain(Test100) Average
σ𝜎\sigmaitalic_σ=60 σ𝜎\sigmaitalic_σ=75 σ𝜎\sigmaitalic_σ=60 σ𝜎\sigmaitalic_σ=75
AirNet 26.11/0.715 20.87/0.421 26.38/0.782 21.04/0.495 15.13/0.508 21.92/0.698 21.91/0.603
PromptIR 26.72/0.746 23.75/0.569 26.57/0.802 23.85/0.655 13.60/0.416 21.91/0.692 22.73/0.647
DA-CLIP 22.18/0.454 19.92/0.301 22.21/0.540 19.65/0.419 16.17/0.509 21.71/0.674 20.31/0.483
Ours 27.24/0.761 24.96/0.625 27.87/0.825 25.28/0.693 17.51/0.552 22.12/0.701 24.16/0.693
TABLE III: Ablation Experiment Results Evaluated with PSNR/SSIM Values. Best results are marked in bold.
Denoise(BSD68) Denoise(Urban100)
Config degradation content reference σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50 σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50 Derain low light
(I) 33.67/0.924 31.07/0.879 27.86/0.782 33.46/0.934 31.11/0.904 27.80/0.837 36.55/0.974 21.49/0.822
(II) 33.73/0.927 31.45/0.887 27.89/0.786 33.61/0.938 32.03/0.919 27.90/0.843 36.84/0.986 21.94/0.829
(III) 33.84/0.928 31.22/0.882 28.04/0.792 33.77/0.940 31.74/0.915 28.15/0.866 38.06/0.981 22.17/0.836
Ours 33.99/0.930 31.36/0.885 28.13/0.798 34.03/0.944 31.84/0.919 28.62/0.873 38.64/0.983 23.34/0.850

IV-D Visualization of Reference Images

We showcase the reference images used by our model, as depicted in Figure 8. The visualization encompasses the degraded image, the ground truth image, the reference image generated by the diffusion model, and the image content generated by the MLLM. As per the illustration, it is evident that the content description generated by MLLM aptly encapsulates the semantic information of the image. Moreover, the reference image generated aligns semantically with the ground truth, thereby providing local details.

Refer to caption
Figure 9: LMDIR can remove particular degradation depending on the human instructions.
Refer to caption
Figure 10: t-SNE visualization of embeddings for different conditions.Left: Text embeddings. Middle: Image embeddings. Right: Refined degradation embeddings provide a clearer separation.
Refer to caption
Figure 11: The visualization of feature map of our model. The degradation embedding is observed to concentrate the feature extraction on the rain streaks, while the content embedding directs the focus towards the primary subject of the image. The reference attention enhances the clarity and sharpness of the resulting feature representation.
Refer to caption
Figure 12: Heatmap illustrating the similarity between reference image embeddings and ground truth image embeddings. The diagonal trend indicates a higher similarity between corresponding images.

IV-E On the Effectiveness of User Instructions

Our proposed model is not only capable of generating degradation prior but also allows for manual user instruction to guide the restoration process. We illustrate this process in Figure 9. We provided the model with the image affected by mixed degradation. The first row shows the image with low lighting and noise, while the second row shows the image with rain streaks and noise. For these two samples, we manually initially provide a degradation description for noise to obtain a result that removes one type of degradation. Subsequently, we provide a degradation prior of low light or rain to obtain the final reconstructed result.

IV-F Ablation Experiment

To fully evaluate the efficacy of our suggested module, we conducted a series of ablation studies. The ablation study was segmented into four distinct parts. Starting from a baseline devoid of any prior, we progressively integrated degradation prior, content prior, and fine-grained image prior. This was done to elucidate the ultimate influence exerted by disparate modules on the overall performance of the model.

IV-F1 Baseline

In this study, we establish a baseline model based on Restormer. This baseline model operates without any reliance on prior information. Leveraging the self-attention inherent capacity for input-adaptive feature extraction, our baseline model demonstrates a Preliminary ability for dealing with diverse degradation, as shown in the first row of Table III.

IV-F2 Effectiveness of degradation prior

Our query-based prompt encoder is designed to enhance the model with a comprehensive degradation prior. In the second set of the ablation study, we replace the transformer block in encoder part of the baseline model with our degradation-aware transform block, effectively providing the network with degradation-specific information. As evidenced by the second row in the table III, the introduction of such degradation knowledge led to marked enhancements across multiple task performances, demonstrating the efficacy of the query-based prompt encoder in providing the model with degradation knowledge.

IV-F3 Effectiveness of textual content prior

Beyond generating degradation descriptions, our MLLM also produces context description text after receiving visual inputs. These descriptions serve as a source of global prior information, bolstering the model’s ability for scene understanding. Building upon Model (I), we replaced the transformer block of bottleneck component within the UNet with our reference-based transformer block. The results presented in the third row of the table reveals that, following the integration of contextual information, the model exhibits marked advancements across a various tasks, thereby affirming the pivotal role of content priors in enhancing overall model performance.

IV-F4 Effectiveness of local image prior

Building upon the Model (II), we incorporate fine-grained image priors into our network. We replace the Transformer block in the decoder part of Model(II) with our reference-based Transformer block. We harness a MLLM to generate contextually rich descriptions which serve as prompts for SDXL. This process yields high-fidelity images that retain the identical semantic content as the input image, yet are devoid of any degradation. The integration of such high-quality images plays a pivotal role in offering detailed guidance to the model. As evidenced by the results in the fourth row of the table III, the integration of fine-grained prior knowledge has led to an improvement in the overall performance of the model. Furthermore, we conduct a visual analysis of the similarity map between the CLIP features of the ground truth image and the reference image using the Test100 dataset. As illustrated in Fig. 12, the reference image demonstrates a high degree of similarity in the CLIP feature space. This observation provides strong evidence for the efficacy of our generated reference image and the associated content description text.

IV-F5 Effectiveness of query-based prompt encoder

Our query-based prompt encoder is motivated by the observation that neither text embeddings nor image embeddings alone can fully capture the information necessary to discriminate between different degradations, as illustrated in Fig. 10. We employ t-SNE [36] to visualize the distributions of edsubscript𝑒𝑑e_{d}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The results clearly demonstrate that both Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and edsubscript𝑒𝑑e_{d}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT individually fail to effectively distinguish between various degradations. However, after processing through our query-based prompt encoder, the boundaries between degradation types become distinctly delineated. This transformation provides strong evidence for the efficacy of our proposed query-based prompt encoder.

IV-G Visualization of feature maps

To underscore the efficacy of our LMDIR approach, we present a visual representation of the feature maps produced by the key components of our proposed architecture in Figure 11. This illustration demonstrates the functionality of our designed blocks. Upon the integration of the global degradation knowledge, the feature map exhibits a pronounced emphasis on the degradation details, highlighting the model’s ability to comprehend the image impairments. Incorporating the content information from the scene descriptions further enhances the model’s capacity to discern the primary subject within the image. Moreover, the incorporation of the fine-grained reference image priors imparts a heightened level of sharpness and clarity to the feature representation, demonstrating the complementary benefits of the multi-modal priors utilized in our LMDIR framework. These visualizations underscore the efficacy of our proposed approach in leveraging the synergistic combination of the MMLMs’ generic knowledge and the diffusion models’ generative capabilities to enable robust and versatile image restoration, overcoming the limitations of specialized models in dynamic degradation scenarios.

V Conclusion

In this research, we proposed a novel multiple-in-one image restoration framework, termed LMDIR. This approach capitalizes on the wealth of prior knowledge offered by both MLLM and diffusion models. To integrate this information, we carefully tailored a query-based prompt encoder, a reference-based transformer block, a content aware transformer block and a degradation-aware transformer block. Extensive experiments conducted across a diverse range of datasets demonstrate that our proposed method not only surpasses existing state-of-the-art techniques but also exhibits remarkable generalization capabilities on out-of-distribution datasets, showing the superior ability of leveraging large models prior to low-level tasks.

References

  • [1] Hello, gpt-4. https://openai.com/index/hello-gpt-4/, 2024. Accessed: 2024-06-24.
  • [2] S. D. Babacan, R. Molina, and A. K. Katsaggelos. Variational bayesian blind deconvolution using a total variation prior. IEEE Transactions on Image Processing, 18(1):12–26, 2009.
  • [3] M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8861–8870, 2024.
  • [4] T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  • [5] H. Chen, J. Gu, Y. Liu, S. A. Magid, C. Dong, Q. Wang, H. Pfister, and L. Zhu. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1703, 2023.
  • [6] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021.
  • [7] L. Chen, X. Chu, X. Zhang, and J. Sun. Simple baselines for image restoration. In European conference on computer vision, pages 17–33. Springer, 2022.
  • [8] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 182–192, 2021.
  • [9] W.-T. Chen, H.-Y. Fang, C.-L. Hsieh, C.-C. Tsai, I. Chen, J.-J. Ding, S.-Y. Kuo, et al. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4196–4205, 2021.
  • [10] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
  • [11] M. Geva, R. Schuster, J. Berant, and O. Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  • [12] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1780–1789, 2020.
  • [13] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015.
  • [14] K. Jiang, Z. Wang, P. Yi, C. Chen, B. Huang, Y. Luo, J. Ma, and J. Jiang. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8346–8355, 2020.
  • [15] Y. Jiang, K. C. Chan, X. Wang, C. C. Loy, and Z. Liu. Robust reference-based super-resolution via c2-matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2103–2112, 2021.
  • [16] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  • [17] X. Kong, C. Dong, and L. Zhang. Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv preprint arXiv:2401.03379, 2024.
  • [18] H. Lee, U. Ullah, J.-S. Lee, B. Jeong, and H.-C. Choi. A brief survey of text driven image generation and maniulation. In 2021 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), pages 1–4. IEEE, 2021.
  • [19] B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17452–17462, 2022.
  • [20] R. Li, R. T. Tan, and L.-F. Cheong. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3175–3185, 2020.
  • [21] Y. Li, Y. Zhang, R. Timofte, L. Van Gool, Z. Tu, K. Du, H. Wang, H. Chen, W. Li, X. Wang, et al. Ntire 2023 challenge on image denoising: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1904–1920, 2023.
  • [22] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  • [23] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018, 2023.
  • [24] J. Ma, T. Cheng, G. Wang, Q. Zhang, X. Wang, and L. Zhang. Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653, 2023.
  • [25] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  • [26] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
  • [27] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [28] V. Potlapalli, S. W. Zamir, S. Khan, and F. S. Khan. Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090, 2023.
  • [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [30] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3937–3946, 2019.
  • [31] W. Ren, X. Cao, J. Pan, X. Guo, W. Zuo, and M.-H. Yang. Image deblurring via enhanced low-rank prior. IEEE Transactions on Image Processing, 25(7):3426–3437, 2016.
  • [32] W. Ren, S. Liu, L. Ma, Q. Xu, X. Xu, X. Cao, J. Du, and M.-H. Yang. Low-light image enhancement via a deep hybrid network. IEEE Transactions on Image Processing, 28(9):4364–4375, 2019.
  • [33] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [34] S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024.
  • [35] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
  • [36] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • [37] C. Wei, W. Wang, W. Yang, and J. Liu. Deep retinex decomposition for low-light enhancement. arxiv 2018. arXiv preprint arXiv:1808.04560, 1808.
  • [38] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
  • [39] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5791–5800, 2020.
  • [40] W. Yang, R. T. Tan, J. Feng, Z. Guo, S. Yan, and J. Liu. Joint rain detection and removal from a single image with contextualized deep networks. IEEE transactions on pattern analysis and machine intelligence, 42(6):1377–1393, 2019.
  • [41] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1357–1366, 2017.
  • [42] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  • [43] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
  • [44] Z. Zhang, Z. Wang, Z. Lin, and H. Qi. Image super-resolution by neural texture transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7982–7991, 2019.
  • [45] H. Zheng, M. Ji, H. Wang, Y. Liu, and L. Fang. Crossnet: An end-to-end reference-based super resolution network using cross-scale warping. In Proceedings of the European conference on computer vision (ECCV), pages 88–104, 2018.
  • [46] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022.