Training-Free Large Model Priors for Multiple-in-One Image Restoration

Xuanhua He, Lang Li, Yingying Wang, Hui Zheng, Ke Cao, Keyu Yan,
Rui Li, Chengjun Xie, Jie Zhang, Man Zhou This work was supported by the National Natural Science Foundation of China under grant number 32171888 and HFIPS Director’s Fund under grant No.2023YZGH04 . Xuanhua He and Lang Li contributed equally; Corresponding author: Jie Zhang and Man Zhou; Xuanhua He, Lang Li, Keyu Yan and Ke Cao are with Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031 and also with University of Science and Technology of China, Hefei 230026, (e-mail: hexuanhua, caoke200820, [email protected]); Yingying Wang and Hui Zheng are with Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University, Xiamen 361102, China (e-mail: wangyingying7, [email protected]); Man Zhou is University of Science and Technology of China, China (e-mail:[email protected], [email protected]); Jie Zhang, Rui Li and Chengjun Xie is with the Intelligent Agriculture Engineering Laboratory of Anhui Province, Institute of Intelligent Machines, and Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China (e-mail: zhangjie, lirui, [email protected];);

Abstract

Image restoration aims to reconstruct the latent clear images from their degraded versions. Despite the notable achievement, existing methods predominantly focus on handling specific degradation types and thus require specialized models, impeding real-world applications in dynamic degradation scenarios. To address this issue, we propose Large Model Driven Image Restoration framework (LMDIR), a novel multiple-in-one image restoration paradigm that leverages the generic priors from large multi-modal language models (MMLMs) and the pretrained diffusion models. In detail, LMDIR integrates three key prior knowledges: 1) global degradation knowledge from MMLMs, 2) scene-aware contextual descriptions generated by MMLMs, and 3) fine-grained high-quality reference images synthesized by diffusion models guided by MMLM descriptions. Standing on above priors, our architecture comprises a query-based prompt encoder, degradation-aware transformer block injecting global degradation knowledge, content-aware transformer block incorporating scene description, and reference-based transformer block incorporating fine-grained image priors. This design facilitates single-stage training paradigm to address various degradations while supporting both automatic and user-guided restoration. Extensive experiments demonstrate that our designed method outperforms state-of-the-art competitors on multiple evaluation benchmarks.

Index Terms:

All-in-one Image Restoration, Large Model, Diffusion Model.

I Introduction

Image restoration, a classical low-level vision task, aims to reconstruct the latent high-quality images from their corrupted counterparts affected by various types of degradation, such as rain streaks [30, 14], low-light conditions [12, 32], and noise [21, 5]. Traditional image restoration methods have developed various natural image priors, e.g., low-rank prior and total variation regularization [31, 2] to regularize the solution space of the latent clear image. However, designing and optimizing these priors is challenging, which limits their practical applicability. The advent of deep learning has brought significant advancements in the field of image restoration. However, in real-world scenarios, such as autonomous driving or surveillance monitoring, degradation types can be random and time-varying, resulting in a wide range of distortions across different scenes.

Existing image restoration models are predominantly tailored to handle specific degradation types, necessitating the training of specialized models for each type of degradation [22, 43]. Furthermore, these methods require complex mechanisms to match the input degraded image with the appropriate restoration model. This paradigm impedes the application of image restoration techniques in real-world applications, where the degradation type is often dynamic.

Refer to caption — Figure 1: The overall pipeline of our proposed method. It achieves high-quality multiple-in-one image restoration with large model prior.

Recently, the image restoration community has shifted its focus toward multiple-in-one restoration tasks [19, 28, 23], where a single model is tasked with handling multiple degradation types. This multi-task capability is achieved by injecting degradation-relevant knowledge into the model, enabling it to discriminate between different degradation types and process image features dynamically. The performance of such models heavily relies on the accurate perception of degradation embeddings [17]. A pioneering work in this task, AirNet [19], learns explicit degradation representations through contrastive learning, while DA-CLIP [23] generates accurate embeddings by fine-tuning a pre-trained CLIP [29] model. In contrast to explicit embedding methods, PromptIR [28] and ProRes [24] utilize prompt learning for implicit embedding learning. However, the former explicit embedding methods typically require a two-stage training approach, consuming large computational resources, especially when fine-tuning large pre-trained models. On the other hand, the latter implicit embedding approaches struggle to generate accurate representations, and the training process itself can be challenging [17].

The recent emergence of multi-modal large language models (MMLM) [42] offers a promising solution to address these challenges. These models, trained on large-scale image-text-paired datasets, possess strong capabilities in image captioning, visual question answering, and scene understanding. Notably, they have demonstrated a powerful ability to comprehend low-level image features, as evidenced by their performance on the Q-Bench [38] benchmark. Leveraging this understanding, MMLMs can vividly and accurately describe image degradations and contents, providing reliable prior information to restoration models without the need for complex fine-tuning or multi-stage training procedures. In addition to the global prior derived from textual descriptions, local fine-grained priors obtained from reference images can further enhance the performance of restoration models. This approach has been explored in the context of reference-based super-resolution tasks [15]. Leveraging the powerful generative capabilities of state-of-the-art diffusion models [27], we can synthesize high-quality reference images that share similar content and semantic context with the input degraded image. These reference images are generated in a guided manner, informed by the contextual text descriptions produced by the multi-modal language models.

Motivated by the observations discussed earlier, we propose a novel multiple-in-one image restoration framework, dubbed LMDIR (Large Model Driven Image Restoration Framework), that leverages prior knowledge from large multi-modal models to tackle diverse image degradations. As illustrated in Figure 1, LMDIR incorporates three essential priors: 1) global degradation knowledge derived from MMLMs; 2) scene-aware contextual descriptions generated by MMLMs; and 3) fine-grained high-quality reference images synthesized by diffusion models guided by the MMLM-generated contextual descriptions. Building upon these priors, the proposed LMDIR architecture consists of four main components: a customized query-based prompt encoder that refines textual information from MMLMs by leveraging image low-level features, a degradation-aware transformer block that incorporates global degradation knowledge, a content-aware transformer block that utilizes the scene-aware content descriptions, and a reference-based transformer block that integrates fine-grained image priors from the synthesized reference images through global and local perspectives. This design empowers LMDIR to adopt a single-stage training strategy capable of addressing diverse and complex image restoration tasks, while also offering the flexibility of automatic or user-guided restoration based on provided prompts. Extensive experiments validate the superiority of LMDIR over other state-of-the-art multiple-in-one image restoration methods across multiple evaluation benchmarks.

Our key contributions can be summarized as follows:

1.

We introduce LMDIR, an innovative framework that harnesses the capabilities of multi-modal large language models and diffusion models to address the challenges of multiple-in-one image restoration.Extensive experiments have shown that LMDIR outperforms state-of-the-art methods for multiple-in-one image restoration tasks.
2.

We introduced a query-based prompt encoder that refines text from multi-modal large language models (MMLMs), enabling automatic or user-guided restoration. We also designed degradation-aware transformer blocks to incorporate global degradation knowledge, enhancing the model’s capability to handle diverse types of degradations. Additionally, we utilized reference-based transformer blocks that leverage fine-grained image priors from synthesized reference images, further improving the quality of image restoration.
3.

Through extensive experiments, we demonstrate that LMDIR outperforms existing state-of-the-art methods on multiple evaluation metrics for multiple-in-one image restoration tasks.

II Related Work

II-A Multiple-in-one Image Restoration

Image restoration endeavors to reconstruct high-quality images from degraded versions that have been impacted by a range of degradations, including noise [21, 5], rain [30, 14], low-light conditions [12, 32], and other factors [9, 35]. Each degradation type exhibits unique characteristics and introduces distinct distortions during the imaging process. Consequently, previous studies have predominantly focused on designing specialized models tailored to handle specific restoration tasks by leveraging prior knowledge about the respective degradations. However, this approach limits the applicability of such models in real-world scenarios, where the degradation type can be dynamic and time-varying. Recently, the image restoration community has shifted its attention towards multiple-in-one restoration tasks, which involve developing a single model capable of handling various types of degradation. Early attempts in this direction employed networks with multiple encoders and decoders, where different encoder-decoder pairs were dedicated to specific degradation types [6, 20]. However, these methods required prior knowledge of the degradation type and were primarily focused on addressing adverse weather conditions. AirNet [19] introduced a two-stage training approach that combined an explicit degradation classifier with contrastive learning to adaptively recognize degradation types and simultaneously perform denoising, rain removal, and image dehazing tasks. Subsequently, PromptIR [28] and ProRes [24] leveraged prompt learning techniques [46] to achieve implicit degradation representation learning, eliminating the need for separate degradation classifiers and two-stage training. DA-CLIP [23], on the other hand, fine-tuned a pre-trained CLIP model to generate accurate degradation embeddings, which were then injected into the restoration network. Explicit classification methods in image restoration typically rely on computationally expensive two-stage training procedures. Further, implicit prompt learning methods often encounter difficulties in generating accurate representations of degradations, and the training process itself can be challenging.

II-B Text Driven Image Manipulation

In recent years, significant advancements have been made in text-based image generation and editing [18]. VQGAN-CLIP [10] combines pre-trained generative models and CLIP to guide the generation process toward a desired target description. Additionally, latent diffusion models [33] have been introduced, which can effectively follow user instructions and improve image quality through text guidance.

Beyond image generation, some tasks have also explored user-guided image editing and painting, such as InstructPix2Pix [4] and Imagic [16]. The progress of diffusion models led to the development of more sophisticated models such as Emu Edit [34]. This approach not only processes standard image inputs but also incorporates depth maps. Concurrently, methods like LEDITS++ [3] have pushed the boundaries of image generation fidelity by leveraging the power of DDPM inversion. However, the field of image restoration has not fully explored the potential of text-driven image restoration.

II-C Reference-based Image Super-Resolution

Another area related to our work is reference-based super-resolution, which differs from single image super-resolution in that it leverages reference images to assist in the super-resolution process by extracting similar textures and details. The reference images are highly similar to the groundtruth high-resolution images. Representative works in this field including CrossNet [45], for instance, establishes inter-image correlations by estimating optical flow between reference and low-resolution images. SRNTT [44] computes similarities between these images and transfers textures from the reference to enhance the low-resolution counterparts. Furthermore, TTSR [39] introduces both hard and soft attention mechanisms to facilitate texture transfer and synthesis. Lastly, C2-Matching [15] pioneers the use of a contrasting correlation network to learn image correlations, followed by a teacher-student correlation distribution to refine the alignment between low-resolution and high-resolution images, thereby enhancing the overall quality of super-resolved images. While significant progress has been made in the reference-based super-resolution field, these methods require manual selection of reference images by users. In contrast, our approach leverages a diffusion model to adaptively generate reference images with highly similar content to the ground truth image, and these generated reference images are then used to improve the performance of image restoration models.

III Method

In this section, we first introduce the three large model priors utilized, followed by a detailed description of the proposed framework.

III-A Large Model Prior

III-A1 Global degradation and content prior

Unlike previous methods that require fine-tuning on large models or two-stage training, we obtain content and degradation embeddings in a training-free manner. We leverage prompt engineering to query a multimodal language model, which outputs degradation information present in the image as well as content information unrelated to degradation. We then obtain the corresponding embeddings of the generated text using the CLIP encoder, serving as global degradation and content embeddings to guide the model’s training. We utilize the GPT4o model [1], which has demonstrated strong performance in low-level tasks, to generate this global degradation priors $\mathbf{e}_{d}$ and content text embedding $\mathbf{e}_{c}$ , shown in the top of Figure 2.

III-A2 Local content prior

In addition to the global prior knowledge provided by text, we also utilize images generated by the diffusion model as fine-grained content priors, providing detailed texture and feature references for image restoration models. Specifically, we input the content text output by the multimodal large language model into the SDXL [27] model as a prompt and use the degraded text as a negative prompt, ensuring that the generated image shares similar content with the ground truth.

III-B Model Architecture

Figure 2 illustrates the overall framework of our proposed method, comprising an image restoration network, a query-based prompt encoder, a multi-modal language model, a diffusion model, and a CLIP encoder. Given a degraded input image $\mathbf{I}\in\mathbb{R}^{\rm H\times W\times 3}$ , we first pass it and a prompt text through the multi-modal language model (MLLM) to generate a degradation text embedding $\mathbf{T}_{d}$ and a content text embedding $\mathbf{T}_{c}$ , respectively. These text embeddings are then encoded by the CLIP encoder to obtain a degradation embedding $\mathbf{e}_{d}\in\mathbb{R}^{\rm N\times C}$ and a content embedding $\mathbf{e}_{c}\in\mathbb{R}^{\rm N\times C}$ . Concurrently, the input image $\mathbf{I}$ is processed by a simple image encoder built on residual blocks to obtain a degraded image representation $\mathbf{I}_{d}\in\mathbb{R}^{\rm C}$ .

We feed the degradation encoding $\mathbf{e}_{d}$ and the identity encoding $\mathbf{I}_{d}$ into the query-based prompt encoder to refine the degradation representation as $\mathbf{Z}_{d}\in\mathbb{R}^{\rm N\times C}$ . Concurrently, we input the content encoding $\mathbf{T}_{c}$ to the diffusion model to synthesize a high-quality reference image $\mathbf{I}_{r}\in\mathbb{R}^{\rm H\times W\times 3}$ . Finally, the backbone restoration network takes the refined degradation representation $\mathbf{Z}_{d}$ , the content encoding $\mathbf{e}_{c}$ , and the reference image $\mathbf{I}_{r}$ as conditions to restore the output image $\mathbf{Y}\in\mathbb{R}^{\rm H\times W\times 3}$ from the degraded input $\mathbf{I}$ .

Our framework effectively integrates global degradation priors $\mathbf{T}_{d}$ and scene-aware content priors $\mathbf{T}_{c}$ extracted from the multi-modal language model (MLLM), as well as fine-grained reference priors $\mathbf{I}_{r}$ generated by the diffusion model. This integrated approach enables robust multiple-in-one image restoration capabilities, leveraging complementary information from the language and diffusion models to tackle a variety of image degradation challenges.

III-C Key Components

III-C1 Query-based Prompt Encoder

The degradation embedding $\mathbf{e}_{d}$ extracted directly from the CLIP encoder cannot be directly applied as a prior for the image restoration network due to two reasons: 1) The CLIP encoder lacks awareness of specific degradation details such as rain streaks and noise distribution, providing only global classification knowledge. 2) The textual description generated by the multi-modal language model may not be entirely reliable. Therefore, we design a query-based prompt encoder to refine $\mathbf{e}_{d}$ into a more fine-grained degradation representation $\mathbf{Z}_{d}$ that can effectively guide the restoration network, while incorporating degradation information from the image itself. Specifically, given a learnable query embedding $\mathbf{E}_{p}\in\mathbb{R}^{\rm\hat{N}\times C}$ , the degradation text embedding $\mathbf{e}_{d}$ from CLIP, and the degraded image representation $\mathbf{I}_{d}$ , the query-based prompt encoder computes the refined degradation representation $\mathbf{Z}_{d}$ . In detail, $\mathbf{E}_{p}$ attends to itself via $\textrm{SA}(.)$ to obtain $\mathbf{E}_{p}^{\prime}$ , which is projected to queries $\mathbf{Q}_{E_{p}}$ . Then, cross-attention is performed between $\mathbf{Q}_{E_{p}}$ and keys/values from $\mathbf{e}_{d}$ to obtain $\mathbf{Z}_{\text{text}}$ encoding degradation information from text, and with $\mathbf{I}_{d}$ to obtain $\mathbf{Z}_{\text{image}}$ encoding image degradation information as:

	$\displaystyle\mathbf{E}_{p}^{\prime}=\textrm{SA}(\mathbf{E}_{p}),$		(1)
	$\displaystyle\mathbf{Q}_{E_{p}}=\mathbf{E}_{p}^{\prime}W_{qp},$		(2)
	$\displaystyle\mathbf{K}_{e_{d}},\mathbf{V}_{e_{d}}=\mathbf{e}_{d}W_{kd},% \mathbf{e}_{d}W_{vd},$		(3)
	$\displaystyle\mathbf{K}_{I_{d}},\mathbf{V}_{I_{d}}=\mathbf{I}_{d}W_{ki},% \mathbf{I}_{d}W_{vi},$		(4)

where $\textrm{SA}(.)$ and $\textrm{CA}(.)$ denote self-attention and cross-attention, respectively. Finally, $\mathbf{Z}_{\text{text}}$ and $\mathbf{Z}_{\text{image}}$ are fused and processed by a feed-forward network (FFN) [11] to yield the refined degradation representation $Z_{d}$ as

	$\displaystyle\mathbf{Z}_{\text{text}}=\textrm{CA}(\mathbf{Q}_{E_{p}},\mathbf{K% }_{e_{d}},\mathbf{V}_{e_{d}}),$		(5)
	$\displaystyle\mathbf{Z}_{\text{image}}=\textrm{CA}(\mathbf{Q}_{E_{p}},\mathbf{% K}_{I_{d}},\mathbf{V}_{I_{d}}),$		(6)
	$\displaystyle\mathbf{Z}_{d}=\textrm{FFN}(\mathbf{Z}_{\text{text}}+\mathbf{Z}_{% \text{image}}).$		(7)

This representation $\mathbf{Z}_{d}$ combines the information from text prior and the image feature, can provide restoration network with a better degradation presentation.

III-C2 Degradation-Aware Transformer Block

In the encoder part of our image restoration model, we employ degradation-aware transformer blocks to inject degradation information and enable dynamic feature processing based on the degradation type. Specifically, each degradation-aware transformer block consists of three components: transposed self-attention [43], gated feed-forward network [43], and a degradation embedding adapter, as shown in Figure. 3. Given the input feature map $\mathbf{F_{i}}$ and degradation embedding $\mathbf{Z}_{d}$ , the operations are defined as follows:

	$\displaystyle\mathcal{G_{A}},\mathcal{G_{F}},\gamma_{a},\beta_{a},\gamma_{f},% \beta_{f}=\textrm{DEA}(\mathbf{Z}_{d}),$		(8)
	$\displaystyle\tilde{F}_{i}=\mathcal{G_{A}}\odot\textrm{TSA}(\gamma_{a}\odot F_% {i}+\beta_{a})+F_{i},$		(9)
	$\displaystyle\hat{F}_{i}=\mathcal{G_{F}}\odot\textrm{GFN}(\gamma_{f}\odot% \tilde{F}_{i}+\beta_{f})+\tilde{F}_{i}$		(10)

where transposed self-attention $\textrm{TSA}(.)$ captures long-range dependencies in $F_{i}$ , Gated forward network $\textrm{GFN}(.)$ refines the local feature. The degradation embedding adapter $\textrm{DEA}(.)$ projects $\mathbf{Z}_{d}$ to the same channel dimension as $\hat{F}_{i}$ and further generating degradation-aware parameters. Specifically, degradation adapter generate the degradation-aware parameters using:

	$\displaystyle\tilde{Z}_{d}=\text{SiLU}(W_{\text{adapt}}Z_{d}),$		(11)
	$\displaystyle E=W_{\text{linear}}\tilde{Z}_{d},$		(12)
	$\displaystyle(\mathcal{G_{A}},\mathcal{G_{F}},\gamma_{a},\beta_{a},\gamma_{f},% \beta_{f})=\text{split}(E,6).$		(13)

where $\gamma$ and $\beta$ and $\mathcal{G}$ are scale, shift and gate parameters modulated by $\mathbf{Z}_{d}$ . $Split(.)$ is the split operator along channel dimension. By integrating $\mathbf{Z}_{d}$ into the transformer blocks, the model can dynamically adapt its feature processing based on specific degradation representation, enabling effective restoration for diverse degradations using a single model.

III-C3 Content-aware Transformer Block

In the bottleneck parts of our image restoration network, we designed content-aware transformer blocks to incorporate local content features and enhance restoration performance. In the bottleneck, we utilize the content text embedding $\mathbf{e}_{c}$ as a reference. As shown in Figure 3. We first project $\mathbf{e}_{c}$ to the same dimension as the feature map $\mathbf{F}_{i}$ using a multi-layer perceptron. Then, we perform self-attention on $\mathbf{F}_{i}$ and calculate the similarity between $\mathbf{F}_{i}$ and the projected $\mathbf{e}_{c}$ . Based on this similarity, we adaptively select and integrate useful features from $\mathbf{e}_{c}$ into $\mathbf{F}_{i}$ , followed by a gated FFN for local feature processing. This operation injects global content priors from the text embedding into the network. The content-aware transformer block can be formulated as:

	$\displaystyle\tilde{\mathbf{F}}_{i}=\textrm{TSA}(\mathbf{F}_{i})+\mathbf{F}_{i},$		(14)
	$\displaystyle\hat{\mathbf{F}}_{i}=\textrm{RA}(\tilde{\mathbf{F}}_{i},\mathbf{e% }_{c})+\tilde{\mathbf{F}}_{i},$		(15)
	$\displaystyle\mathbf{F}_{i+1}=\textrm{GFN}(\hat{\mathbf{F}}_{i}).$		(16)

Here we utilized the reference-attention $RA(.)$ to inject the reference feature, due to the token length of $\mathbf{e}_{c}$ is a fixed number. The integrated features $\hat{\mathbf{F}}_{i}$ are further processed by a gated FFN to produce the output $\mathbf{F}_{i+1}$ . Given the input feature $\tilde{\mathbf{F}}_{i}$ and the reference feature denoted as $\mathbf{e}_{c}$ , the operation of $\textrm{RA}(.)$ can be defined as follows:

	$\displaystyle Q=\tilde{\mathbf{F}}_{i}W_{q},$		(17)
	$\displaystyle K,V=\mathbf{F}_{Ref}W_{k},\mathbf{F}_{Ref}W_{v},$		(18)
	$\displaystyle Sim=\texttt{softmax}(QK^{T}),$		(19)
	$\displaystyle\mathbf{F}_{out}=Sim*V.$		(20)

III-C4 Reference-based Transformer Block

In the decoder parts of our image restoration network, we introduce reference-based transformer blocks to integrate fine-grained reference features 4. Specifically, we leverage the reference image $\mathbf{I}_{r}$ , generated by the diffusion model, as our reference. This block is designed to extract both global and local similar features from the reference image. To achieve this, we employ a hybrid approach that combines global and local attention mechanisms. The global reference attention utilizes a transposed cross-attention mechanism to compute the similarity between the two images along the channel dimension. In contrast, the local reference attention employs convolution to fuse similarity features along the spatial dimension. Given the input feature $\mathbf{F}_{i}$ and reference image $\mathbf{I}_{r}$ , this process can be described as follows:

	$\displaystyle\mathbf{F}_{ref}=\phi(\mathbf{I}_{r}),$		(21)
	$\displaystyle\tilde{\mathbf{F}}_{i}=\textrm{TSA}(\mathbf{F}_{i})+\mathbf{F}_{i},$		(22)
	$\displaystyle\tilde{\mathbf{F}}^{l}_{i},\tilde{\mathbf{F}}^{g}_{i}=\textrm{% split}(\tilde{\mathbf{F}}_{i},2),$		(23)
	$\displaystyle\hat{\mathbf{F}}_{i}=\Theta([\textrm{LRA}(\tilde{\mathbf{F}}^{l}_% {i},\mathbf{F}_{ref}),\textrm{GRA}(\tilde{\mathbf{F}}^{g}_{i},\mathbf{F}_{ref}% )])+\tilde{\mathbf{F}}_{i},$		(24)
	$\displaystyle\mathbf{F}_{i+1}=\textrm{GFN}(\hat{\mathbf{F}}_{i})+\hat{\mathbf{% F}}_{i}.$		(25)

Here, $\phi(.)$ is the convolution operator that projects $\mathbf{I}_{r}$ to $\mathbf{F}_{ref}$ for dimension alignment. The $\textrm{TSA}(.)$ and $\textrm{split}(.)$ are the transposed self-attention and channel split operators, respectively. After generating the outputs from local reference attention $\textrm{LRA}(.)$ and global reference attention $\textrm{GRA}(.)$ , the two features are concatenated and fused through the linear projection $\Theta(.)$ . Finally, a gated forward network, $\textrm{GFN}(.)$ , is utilized to enhance the locality of the features. The $\textrm{GRA}(.)$ is the cross attention version of $\textrm{TSA}(.)$ , where $Q$ is derived from $\tilde{\mathbf{F}}_{i}$ and $K,V$ are generated from $\mathbf{F}_{ref}$ . The local reference attention can be described as below:

	$\displaystyle\mathbf{F}_{j}=\mathbf{W}_{2}\text{ReLU}(\mathbf{W}_{1}\tilde{% \mathbf{F}}^{l}_{i})$		(26)
	$\displaystyle\mathbf{F}_{k}=\mathbf{W}_{2}\text{ReLU}(\mathbf{W}_{1}\mathbf{% F}{ref})$		(27)
	$\displaystyle Sim=\text{Softmax}(\mathbf{W}_{a}*(\mathbf{F}_{j}+\mathbf{F}_{k}))$		(28)
	$\displaystyle\mathbf{F}_{\text{agg}}=\mathbf{F}_{j}+Sim\odot\mathbf{F}_{k}$		(29)

where ( $\mathbf{W}_{1}$ ), ( $\mathbf{W}_{2}$ ), and ( $\mathbf{W}_{a}$ ) are convolutional filters, ( $*$ ) denotes convolution operation, ( $\odot$ ) denotes element-wise multiplication, and (ReLU) and (Softmax) are the activation function and softmax function, respectively.

III-D Loss Function

Following the widely-adapted methods, we utilized L1 norm between the output $Y$ and groundtruth $G$ as our loss function:

L=||Y-G||_{1}

(30)

TABLE I: Quantitative comparison of our method with other state-of-the-art approaches in noise-rain-lowlight settings. PSNR/SSIM values are reported. The best results are marked in bold.

Method	Denoise(BSD68)			Denoise(Urban100)			Derain	lowlight	Average
Method	$\sigma=15$	$\sigma=25$	$\sigma=50$	$\sigma=15$	$\sigma=25$	$\sigma=50$	Derain	lowlight	Average
HINet	32.35/0.925	26.09/0.869	25.91/0.767	33.68/0.938	30.63/0.908	27.50/0.850	37.63/0.980	16.55/0.769	28.79/0.875
NAFNet	32.93/0.915	30.36/0.862	27.22/0.759	31.98/0.920	29.56/0.881	26.24/0.795	32.22/0.939	20.72/0.777	28.90/0.856
SwinIR	33.62/0.926	31.00/0.879	27.68/0.780	33.57/0.938	31.13/0.906	27.60/0.835	34.32/0.965	18.86/0.800	29.72/0.878
Restormer	33.67/0.924	31.07/0.876	27.86/0.782	33.46/0.934	31.09/0.904	27.80/0.837	36.55/0.974	21.49/0.822	30.37/0.881
AirNet	33.66/0.923	31.10/0.881	27.72/0.780	33.55/0.937	31.10/0.905	27.77/0.837	35.80/0.971	16.21/0.673	29.61/0.863
PromptIR	33.63/0.927	31.02/0.880	27.77/0.782	33.45/0.937	31.05/0.907	27.71/0.839	36.37/0.975	21.14/0.831	30.27/0.884
DA-CLIP	30.30/0.837	27.54/0.758	24.77/0.619	29.30/0.819	25.18/0.634	23.71/0.613	36.37/0.965	19.06/0.789	27.03/0.754
Ours	34.00/0.930	31.38/0.886	28.15/0.798	34.15/0.945	31.84/0.919	28.62/0.873	38.64/0.983	23.24/0.850	31.25/0.898

IV Experiments

IV-A Datasets and Benchmark

We evaluate our method on a multiple-in-one image restoration task comprising three representative subtasks: image deraining, image denoising, and low-light image enhancement. For image deraining datasets, we chose the Rain1800 [40] dataset for training and evaluate on 100 test images from the Rain100L [41] dataset. For denoising, we use synthetically generated noisy image with noise level of $\sigma\in\{15,25,50\}$ on the WED [25] dataset for training, and evaluate on the Urban100 [13] and BSD68 [26] datasets. For low-light enhancement, we train on the LOL [37] dataset and test on its corresponding test set. During training, we randomly sample these three datasets with a uniform distribution. We compare our method against classic image restoration networks (HINet [8], NAFNet [7], SwinIR [22], Restormer [43]) and recent multiple-in-one approaches (AirNet [19], PromptIR [28], DA-CLIP [23]). We adopt PSNR and SSIM to assess the performance of model.

IV-B Implementation Details

We train our model using the PyTorch framework on a single NVIDIA RTX 3090 GPU with the Adam optimizer. During training, images are randomly cropped into 128×128 patches with a batch size of 2. The total number of training iterations is 300000. The initial learning rate is set to 2e-4 for the whole training process.

To generate degradation text $\mathbf{T_{d}}$ and content text $\mathbf{T_{c}}$ , we utilize the GPT4o multi-modal language model. For synthesizing reference images $\mathbf{I_{r}}$ , we employ the Stable Diffusion XL (SDXL) v1.0 diffusion model with 30 sampling steps. We generate all the reference image and text descriptions before training our model.

To ensure a fair comparison, we retrain all baseline models using the same framework from PromptIR [28] and identical hyperparameters.

IV-C Comparison with Sota Methods

IV-C1 Multiple-in-one restoration evaluation

In Table I, we present a comparison between our proposed LMDIR approach and existing state-of-the-art methods, demonstrating substantial enhancements across various tasks. Notably, in comparison to PromptIR, our method achieved an average improvement of 2.3 dB in PSNR of the image deraining task. Additionally, the denoising and low-light image enhancement tasks exhibited marked progress. PromptIR’s inability to produce accurate restoration outcomes can be attributed to its implicit degenerate feature learning method. It is noteworthy to highlight that DA-CLIP necessitates an extensive volume of data for training due to its reliance on diffusion models and fails to yield satisfactory results within our settings. Contrasting with these methods, our approach leverages the prior knowledge provided by the large model and the information intrinsically present in the degraded image, resulting in superior performance.

The comparative visualization of different methods is shown in Figures 5 and 6. For each task, we opted for two representive images for visual comparison. Within the the denoising task, we set the noise level $\sigma$ =50 for comparison. As the figure depicted, our methods outperforms others in achieving superior restoration outcomes. In the context of denoising, DA-CLIP falls short in complete noise reducing, whereas PromptIR induces a loss of high-frequency details within the image. In the low-light image enhancement task, the color accuracy of our method closely aligns with the ground truth, while the results generated by AirNet manifest a dark texture. Evaluating the image deraining results, residual rain streaks are discernible in the images produced by AirNet and PromptIR. In contrast, our method exhibits the highest quality in rain removal results.

IV-C2 Model generalization performance

Furthermore, we conducted an additional evaluation of the generalization capabilities of various multi-in-one restoration models on out-of-distribution (OOD) data, thereby evaluating their practical performance in real-world applications. We analyzed the impact of varying noise and rain streak intensities on image restoration tasks. More specifically, for the denoising task, we selected two distinct noise level, 60 and 75. In the context of image deraining, we opted for the Rain100H [41] and Test100 [41] datasets as our testing datasets, both of which differ substantially from Rain100L and feature more intense rainfall conditions. The results of these experiments are presented in Table II and Figure 7. Our results indicate that the performance of both AirNet and DA-CLIP, whose degradation knowledge is solely based on limited classification knowledge, significantly diminishes when confronted with OOD data. In contrast, the implicit degradation representation of PromptIR exhibits a certain degree of adaptability to OOD data, thereby outperforming AirNet in OOD dataset significantly. Our proposed methods, which combines the prior knowledge of large-scale models with the inherited information present in degraded images, demonstrates an enhanced performance in the presence of OOD data.

TABLE II: Performance on unseen noise level of (

\sigma

= 60, 75) and severe rain conditions from the Rain100H and test100 dataset. PSNR/SSIM values are reported. The best results are marked in bold.

Method	Denoise(BSD68)		Denoise(Urban100)		Derain(Rain100H)	Derain(Test100)	Average
Method	$\sigma$ =60	$\sigma$ =75	$\sigma$ =60	$\sigma$ =75	Derain(Rain100H)	Derain(Test100)	Average
AirNet	26.11/0.715	20.87/0.421	26.38/0.782	21.04/0.495	15.13/0.508	21.92/0.698	21.91/0.603
PromptIR	26.72/0.746	23.75/0.569	26.57/0.802	23.85/0.655	13.60/0.416	21.91/0.692	22.73/0.647
DA-CLIP	22.18/0.454	19.92/0.301	22.21/0.540	19.65/0.419	16.17/0.509	21.71/0.674	20.31/0.483
Ours	27.24/0.761	24.96/0.625	27.87/0.825	25.28/0.693	17.51/0.552	22.12/0.701	24.16/0.693

TABLE III: Ablation Experiment Results Evaluated with PSNR/SSIM Values. Best results are marked in bold.

				Denoise(BSD68)			Denoise(Urban100)
Config	degradation	content	reference	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50	Derain	low light
(I)	✕	✕	✕	33.67/0.924	31.07/0.879	27.86/0.782	33.46/0.934	31.11/0.904	27.80/0.837	36.55/0.974	21.49/0.822
(II)	✓	✕	✕	33.73/0.927	31.45/0.887	27.89/0.786	33.61/0.938	32.03/0.919	27.90/0.843	36.84/0.986	21.94/0.829
(III)	✓	✓	✕	33.84/0.928	31.22/0.882	28.04/0.792	33.77/0.940	31.74/0.915	28.15/0.866	38.06/0.981	22.17/0.836
Ours	✓	✓	✓	33.99/0.930	31.36/0.885	28.13/0.798	34.03/0.944	31.84/0.919	28.62/0.873	38.64/0.983	23.34/0.850

IV-D Visualization of Reference Images

We showcase the reference images used by our model, as depicted in Figure 8. The visualization encompasses the degraded image, the ground truth image, the reference image generated by the diffusion model, and the image content generated by the MLLM. As per the illustration, it is evident that the content description generated by MLLM aptly encapsulates the semantic information of the image. Moreover, the reference image generated aligns semantically with the ground truth, thereby providing local details.

IV-E On the Effectiveness of User Instructions

Our proposed model is not only capable of generating degradation prior but also allows for manual user instruction to guide the restoration process. We illustrate this process in Figure 9. We provided the model with the image affected by mixed degradation. The first row shows the image with low lighting and noise, while the second row shows the image with rain streaks and noise. For these two samples, we manually initially provide a degradation description for noise to obtain a result that removes one type of degradation. Subsequently, we provide a degradation prior of low light or rain to obtain the final reconstructed result.

IV-F Ablation Experiment

To fully evaluate the efficacy of our suggested module, we conducted a series of ablation studies. The ablation study was segmented into four distinct parts. Starting from a baseline devoid of any prior, we progressively integrated degradation prior, content prior, and fine-grained image prior. This was done to elucidate the ultimate influence exerted by disparate modules on the overall performance of the model.

IV-F1 Baseline

In this study, we establish a baseline model based on Restormer. This baseline model operates without any reliance on prior information. Leveraging the self-attention inherent capacity for input-adaptive feature extraction, our baseline model demonstrates a Preliminary ability for dealing with diverse degradation, as shown in the first row of Table III.

IV-F2 Effectiveness of degradation prior

Our query-based prompt encoder is designed to enhance the model with a comprehensive degradation prior. In the second set of the ablation study, we replace the transformer block in encoder part of the baseline model with our degradation-aware transform block, effectively providing the network with degradation-specific information. As evidenced by the second row in the table III, the introduction of such degradation knowledge led to marked enhancements across multiple task performances, demonstrating the efficacy of the query-based prompt encoder in providing the model with degradation knowledge.

IV-F3 Effectiveness of textual content prior

Beyond generating degradation descriptions, our MLLM also produces context description text after receiving visual inputs. These descriptions serve as a source of global prior information, bolstering the model’s ability for scene understanding. Building upon Model (I), we replaced the transformer block of bottleneck component within the UNet with our reference-based transformer block. The results presented in the third row of the table reveals that, following the integration of contextual information, the model exhibits marked advancements across a various tasks, thereby affirming the pivotal role of content priors in enhancing overall model performance.

IV-F4 Effectiveness of local image prior

Building upon the Model (II), we incorporate fine-grained image priors into our network. We replace the Transformer block in the decoder part of Model(II) with our reference-based Transformer block. We harness a MLLM to generate contextually rich descriptions which serve as prompts for SDXL. This process yields high-fidelity images that retain the identical semantic content as the input image, yet are devoid of any degradation. The integration of such high-quality images plays a pivotal role in offering detailed guidance to the model. As evidenced by the results in the fourth row of the table III, the integration of fine-grained prior knowledge has led to an improvement in the overall performance of the model. Furthermore, we conduct a visual analysis of the similarity map between the CLIP features of the ground truth image and the reference image using the Test100 dataset. As illustrated in Fig. 12, the reference image demonstrates a high degree of similarity in the CLIP feature space. This observation provides strong evidence for the efficacy of our generated reference image and the associated content description text.

IV-F5 Effectiveness of query-based prompt encoder

Our query-based prompt encoder is motivated by the observation that neither text embeddings nor image embeddings alone can fully capture the information necessary to discriminate between different degradations, as illustrated in Fig. 10. We employ t-SNE [36] to visualize the distributions of $e_{d}$ and $I_{d}$ . The results clearly demonstrate that both $I_{d}$ and $e_{d}$ individually fail to effectively distinguish between various degradations. However, after processing through our query-based prompt encoder, the boundaries between degradation types become distinctly delineated. This transformation provides strong evidence for the efficacy of our proposed query-based prompt encoder.

IV-G Visualization of feature maps

To underscore the efficacy of our LMDIR approach, we present a visual representation of the feature maps produced by the key components of our proposed architecture in Figure 11. This illustration demonstrates the functionality of our designed blocks. Upon the integration of the global degradation knowledge, the feature map exhibits a pronounced emphasis on the degradation details, highlighting the model’s ability to comprehend the image impairments. Incorporating the content information from the scene descriptions further enhances the model’s capacity to discern the primary subject within the image. Moreover, the incorporation of the fine-grained reference image priors imparts a heightened level of sharpness and clarity to the feature representation, demonstrating the complementary benefits of the multi-modal priors utilized in our LMDIR framework. These visualizations underscore the efficacy of our proposed approach in leveraging the synergistic combination of the MMLMs’ generic knowledge and the diffusion models’ generative capabilities to enable robust and versatile image restoration, overcoming the limitations of specialized models in dynamic degradation scenarios.

V Conclusion

In this research, we proposed a novel multiple-in-one image restoration framework, termed LMDIR. This approach capitalizes on the wealth of prior knowledge offered by both MLLM and diffusion models. To integrate this information, we carefully tailored a query-based prompt encoder, a reference-based transformer block, a content aware transformer block and a degradation-aware transformer block. Extensive experiments conducted across a diverse range of datasets demonstrate that our proposed method not only surpasses existing state-of-the-art techniques but also exhibits remarkable generalization capabilities on out-of-distribution datasets, showing the superior ability of leveraging large models prior to low-level tasks.

References

[1] Hello, gpt-4. https://openai.com/index/hello-gpt-4/, 2024. Accessed: 2024-06-24.
[2] S. D. Babacan, R. Molina, and A. K. Katsaggelos. Variational bayesian blind deconvolution using a total variation prior. IEEE Transactions on Image Processing, 18(1):12–26, 2009.
[3] M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8861–8870, 2024.
[4] T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
[5] H. Chen, J. Gu, Y. Liu, S. A. Magid, C. Dong, Q. Wang, H. Pfister, and L. Zhu. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1703, 2023.
[6] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021.
[7] L. Chen, X. Chu, X. Zhang, and J. Sun. Simple baselines for image restoration. In European conference on computer vision, pages 17–33. Springer, 2022.
[8] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 182–192, 2021.
[9] W.-T. Chen, H.-Y. Fang, C.-L. Hsieh, C.-C. Tsai, I. Chen, J.-J. Ding, S.-Y. Kuo, et al. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4196–4205, 2021.
[10] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
[11] M. Geva, R. Schuster, J. Berant, and O. Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
[12] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1780–1789, 2020.
[13] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015.
[14] K. Jiang, Z. Wang, P. Yi, C. Chen, B. Huang, Y. Luo, J. Ma, and J. Jiang. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8346–8355, 2020.
[15] Y. Jiang, K. C. Chan, X. Wang, C. C. Loy, and Z. Liu. Robust reference-based super-resolution via c2-matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2103–2112, 2021.
[16] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
[17] X. Kong, C. Dong, and L. Zhang. Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv preprint arXiv:2401.03379, 2024.
[18] H. Lee, U. Ullah, J.-S. Lee, B. Jeong, and H.-C. Choi. A brief survey of text driven image generation and maniulation. In 2021 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), pages 1–4. IEEE, 2021.
[19] B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17452–17462, 2022.
[20] R. Li, R. T. Tan, and L.-F. Cheong. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3175–3185, 2020.
[21] Y. Li, Y. Zhang, R. Timofte, L. Van Gool, Z. Tu, K. Du, H. Wang, H. Chen, W. Li, X. Wang, et al. Ntire 2023 challenge on image denoising: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1904–1920, 2023.
[22] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
[23] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018, 2023.
[24] J. Ma, T. Cheng, G. Wang, Q. Zhang, X. Wang, and L. Zhang. Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653, 2023.
[25] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
[26] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
[27] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[28] V. Potlapalli, S. W. Zamir, S. Khan, and F. S. Khan. Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090, 2023.
[29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[30] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3937–3946, 2019.
[31] W. Ren, X. Cao, J. Pan, X. Guo, W. Zuo, and M.-H. Yang. Image deblurring via enhanced low-rank prior. IEEE Transactions on Image Processing, 25(7):3426–3437, 2016.
[32] W. Ren, S. Liu, L. Ma, Q. Xu, X. Xu, X. Cao, J. Du, and M.-H. Yang. Low-light image enhancement via a deep hybrid network. IEEE Transactions on Image Processing, 28(9):4364–4375, 2019.
[33] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[34] S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024.
[35] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
[36] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
[37] C. Wei, W. Wang, W. Yang, and J. Liu. Deep retinex decomposition for low-light enhancement. arxiv 2018. arXiv preprint arXiv:1808.04560, 1808.
[38] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
[39] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5791–5800, 2020.
[40] W. Yang, R. T. Tan, J. Feng, Z. Guo, S. Yan, and J. Liu. Joint rain detection and removal from a single image with contextualized deep networks. IEEE transactions on pattern analysis and machine intelligence, 42(6):1377–1393, 2019.
[41] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1357–1366, 2017.
[42] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
[43] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
[44] Z. Zhang, Z. Wang, Z. Lin, and H. Qi. Image super-resolution by neural texture transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7982–7991, 2019.
[45] H. Zheng, M. Ji, H. Wang, Y. Liu, and L. Fang. Crossnet: An end-to-end reference-based super resolution network using cross-scale warping. In Proceedings of the European conference on computer vision (ECCV), pages 88–104, 2018.
[46] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022.