11institutetext: University of Science and Technology of China 22institutetext: National University of Singapore 33institutetext: Microsoft Research Asia
33email: {xin.li, chenzhibo}@ustc.edu.cn, {lbc31415926, hanxinzhu, renyulin}@mail.ustc.edu.cn, [email protected], [email protected]

UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt

Xin Li\orcidlink0000-0002-6352-6523\dagger 11    Bingchen Li\orcidlink0009-0001-9990-7790\dagger 11    Yeying Jin\orcidlink0000-0001-7818-9534 22    Cuiling Lan\orcidlink0000-0001-9145-9957 33    Hanxin Zhu\orcidlink0009-0006-3524-0364
Yulin Ren\orcidlink0009-0006-4815-7973
1111
   Zhibo Chen\orcidlink0000-0002-8525-5066 11
Abstract

Compressed Image Super-resolution (CSR) aims to simultaneously super-resolve the compressed images and tackle the challenging hybrid distortions caused by compression. However, existing works on CSR usually focus on single compression codec, i.e., JPEG, ignoring the diverse traditional or learning-based codecs in the practical application, e.g., HEVC, VVC, HIFIC, etc. In this work, we propose the first universal CSR framework, dubbed UCIP, with dynamic prompt learning, intending to jointly support the CSR distortions of any compression codecs/modes. Particularly, an efficient dynamic prompt strategy is proposed to mine the content/spatial-aware task-adaptive contextual information for the universal CSR task, using only a small amount of prompts with spatial size 1×1111\times 11 × 1. To simplify contextual information mining, we introduce the novel MLP-like framework backbone for our UCIP by adapting the Active Token Mixer (ATM) to CSR tasks for the first time, where the global information modeling is only taken in horizontal and vertical directions with offset prediction. We also build an all-in-one benchmark dataset for the CSR task by collecting the datasets with the popular 6 diverse traditional and learning-based codecs, including JPEG, HEVC, VVC, HIFIC, etc., resulting in 23 common degradations. Extensive experiments have shown the consistent and excellent performance of our UCIP on universal CSR tasks. The project can be found in https://lixinustc.github.io/UCIP.github.io

Keywords:
Dynamic Prompt Universal Compressed Image SR MLP-like framework
footnotetext: \dagger Equal Contribution.

1 Introduction

In recent years, we have witnessed the significant development of Deep Neural Networks (DNNs) in image super-resolution (SR) [26, 63, 12, 24, 62, 71, 60, 31, 65, 50], where the image is degraded with low-resolution artifacts. However, in the practical scenario, due to the limitation of storage and bandwidth, collected images are also inevitably compressed with traditional image codecs, such as JPEG [59], and BPG [49]. Accordingly, compressed image super-resolution (CSR) is proposed as an advanced task, which greatly meets the requirements of industry and human life. In general, the low-quality images in CSR are jointly degraded with compression artifacts, e.g., block artifacts, ring effects, and low-resolution artifacts. The severe and heterogeneous degradation poses more challenges and high requirements for the CSR backbones. Moreover, in real applications, the compression codecs are usually diverse in different platforms, which urgently entails the Universal CSR model.

There are some pioneering works [26, 10, 24, 60] attempting to remove this hard degradation by improving the representation ability. The representative strategy is to design the CSR backbone with the Transformer, which profits from the self-attention module. For instance, Swin2SR [10] introduces the enhanced Swin Transformer [37, 36] (i.e., SwinV2) to boost the restoration capability of the CSR backbone. HST [24] utilizes the hierarchical backbone to excavate multi-scale representation for CSR. Despite the transformer-based backbones having revealed strong recovery capability in CSR, the high computational cost of the transformer prevents its application and training optimization [34, 10]. Recently, Multi-layer perceptron (MLP) has demonstrated its potential to achieve the trade-off between the computational cost and global dependency modeling in the classification [8, 33, 64, 55, 54], benefiting from its efficient and effective token mixer strategies. Inspired by this, the first MLP-based framework MAXIM [56] in image processing is proposed, where the image tokens interact in global and local manners with multi-axis MLP, respectively. However, the above works only focus on single distortion removal, which lacks enough universality for CSR tasks.

In this work, we propose the first universal framework, dubbed UCIP, for CSR tasks with our dynamic prompt strategy based on an MLP-like module. It is noteworthy that the optimal contextual information obtained with the CSR network tends to vary with the content/spatial and degradation type, which entails the content-aware task-adaptive contextual information modeling capability. To achieve this, existing prompt-based IR [45, 32, 29, 23] methods have attempted to set multiple prompts with image size, lacking adaptability for various input sizes and leading to more computational cost. In contrast, our dynamic prompt strategy can not only achieve content-aware task-adaptive modulation but also own more applicability. Concretely, we propose the Dynamic Prompt generation Module (DPM), where a group of prompts with the size of 1×1×Cp11subscript𝐶𝑝1\times 1\times C_{p}1 × 1 × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is set and Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the channel dimension. Then spatial-wise composable coefficients H×W×Cp𝐻𝑊subscript𝐶𝑝H\times W\times C_{p}italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are generated with the distorted images, which guides the cooperation of these prompt bases to form the dynamic prompt with image size, thereby owning the content/spatial- and task-adaptive modulation capability.

Based on the powerful DPM, we can achieve the universal CSR framework by incorporating it into existing CSR backbones. However, in the commonly used Transformer backbone, contextual information modeling is achieved with the cost attention module, where any two tokens are required to interact. In contrast, an active token mixer (ATM) [64] has been proposed for the MLP-like backbone to reduce the computational cost by implicitly achieving contextual information modeling in the horizontal and vertical directions with offset generation. However, no works explore the potential of this backbone on low-level vision tasks. Inspired by this, we propose the dynamic prompt-guided token mixer block (PTMB) by fusing the advantages of our DPM and ATM, where our DPM can guide the contextual information modeling process of the ATM by modulating the offset prediction and toke mixer. Notably, only horizontal and vertical contextual modeling in ATMs lacks enough local information utilization. Consequently, we increase a local branch in PTMB with one 3×3333\times 33 × 3 convolution. Based on PTMB, our UCIP can achieve efficient and excellent universal compressed image super-resolution for different codecs/modes.

To build the benchmark dataset for universal CSR tasks, we collected the datasets with 6 representative image codecs, including 3 traditional codecs and 3 learning-based codecs. Concretely, traditional codecs consist of JPEG [59], all-intra mode of HEVC [49], and VVC [5]. For learning-based codecs, to ensure the diversity of degradations, we select 3 codecs with different optimization objectives, i.e., PSNR-oriented, SSIM-oriented, and GAN-based codecs. In this way, our database can cover the prominent compression types in recent industry and research fields. We have compared our UCIP and reproduced state-of-the-art methods on this benchmark, which showcases the superiority and robustness of our UCIP.

The contributions of this paper are listed as follows:

  • We propose the first universal framework, i.e., UCIP for the CSR tasks with our dynamic prompt strategy, intending to achieve the "all-in-one" for the CSR degradations with different codecs/modes.

  • We propose the dynamic prompt-guided token mixer block (PTMB) by fusing the advantages of our proposed dynamic prompt generation module (DPM) and revised active token mixer (ATM), as the basic block for UCIP.

  • We propose the first dataset benchmark for universal CSR tasks by collecting datasets with 6 prominent traditional and learning-based codecs, consisting of multiple compression degrees. This ensures the diversity of degradations in the benchmark dataset, thereby being reliable as the benchmark to measure different CSR methods.

  • Extensive experiments on our universal CSR benchmark dataset have revealed the effectiveness of our proposed UCIP, which outperforms the recent state-of-the-art transformer-based methods with lower computational costs.

2 Related Works

2.1 Compressed Image Super-resolution

Compressed Image Super-resolution aims to tackle complicated hybrid distortions, including compression artifacts and low-resolution artifacts [21, 67, 24, 26, 30, 14, 66, 7]. The first challenge for this task was held in the AIM2022 [67], where the image is first downsampled with the bicubic operation and then compressed with a JPEG codec. To solve this hard degradation, some works [26, 10, 46, 24] seek to utilize the Transformer-based architecture as their backbone. For instance, Swin2SR [10] eliminates the training instability and the requirements for large data for CSR by incorporating the Swin Transformer V2 to SwinIR [34]. HST [24] utilizes the multi-scale information flow and pre-training strategy [28] to enhance the restoration process with a hierarchical swin transformer. To further fuse the advantages of convolution and transformer, Qin et al. [46] proposes a dual-branch network, which achieves the consecutive interaction between the convolution branch and transformer branch. In contrast, to achieve the trade-off between the performance and computational cost, we aim to explore one efficient and effective framework for universal CSR problem.

2.2 MLP-like Models

As the alternative model for Transformer and Convolution Neural Networks (CNNs), MLP-like models [33, 8, 54, 64, 70, 55, 52, 51, 18, 68] have attracted great attention for their concise architectures. Typically, the noticeable success of MLP-like models stems from the well-designed token-mixing strategies [64]. The pioneering works, MLP-Mixer [54] and ResMLP [55] adopt two types of MLP layers, i.e., channel-mixing MLP and token-mixing MLP, which are responsible for the channel and spatial information interaction. To simplify the token-mixing MLP, Hou et al. [18] and Tang et al. [51] decompose the token-mixing MLP into the horizontal and vertical token-mixing MLPs. Sequentially, As-MLP [33] introduces the two-axis token shift in different channels to achieve global token mixing. There are also several works that take the hand-craft windows to enlarge the receptive field for better spatial token mixing, e.g., WaveMLP [52], and MorphMLP [70]. However, the token-mixing strategies in the above methods are restrictively fixed and lack flexibility and adaptability for different contents. To overcome this, ATM [64] is proposed to achieve the active token selection and mixing in each channel. Based on the progress of the above MLP-like models, MAXIM [56] is the first work to introduce the MLP-like model in low-level processing. However, the potential of MLP-like models is yet to be explored, as restoration model not only requires long-range token mixing but also demands efficient local feature extractions.

2.3 Prompt Learning

In the field of Natural Language Processing (NLP), prompt learning has emerged as a pivotal technique, particularly with the advent of transformer-based pre-trained models such as GPT [6, 43] and BERT [11]. Prompt learning involves providing models with specific textual cues that guide their processing of subsequent input, which helps models fast adapt to unseen tasks or applications. This approach has proven instrumental in directing models for task-specific outputs without necessitating extensive retraining or fine-tuning. Despite the success in NLP tasks, some researchers adopt prompt learning into vision tasks [20, 47, 27, 2, 22, 35, 61]. Among them, PromptIR [45] is the first to explore the low-level restoration model with prompts to facilitate multi-task learning [57, 58, 25, 48, 13]. Prompts here act as a small set of learnable parameters which interact with image features during training, providing task-specific guidance. Therefore, the prompts should be as much dynamic as possible to adapt to various degradation tasks and different pixel distributions.

Refer to caption
Figure 1: Illustration of our proposed UCIP. From top to bottom: (a) The overall framework of UCIP. The LR is first enhanced through several consecutive PTMBs, then upsampled by HR reconstruction module. (b) The architecture of PTMB. Each PTMB utilizes the dynamic prompt generated from a DPM and several cascading PTMMs to iteratively refine distorted inputs. (c) The architecture of PTMM. PTMM takes prompt P along with image feature FXisubscriptFsubscriptX𝑖\textbf{F}_{{\text{X}}_{i}}F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as input to adaptively generate offsets, which facilitate the network to perform content/spatial-aware task-adaptive contextual information extraction.

3 Methods

In this section, we first clarify the principle and construction of our dynamic prompt generation module in Sec. 3.1, and then describe how to achieve the basic block of our UCIP, i.e., dynamic prompt-guided token mixer block in Sec. 3.2.1. Finally, we depict the whole framework of our UCIP in Sec. 3.3.

3.1 Dynamic Prompt Generation Module

As stated in Sec. 1, the universal CSR tasks entail the content/spatial- and task-adaptive modulation. An intuitive strategy is to set one prompt with the image size for each task individually or fuse them adaptively. However, it will bring severe parameter costs with the increase of the task number or image size [45]. To mitigate this, we propose the dynamic prompt strategy, and design the corresponding dynamic prompt generation module (DPM), intending to only exploit a small amount of prompt with 1×1×Cp11subscript𝐶𝑝1\times 1\times C_{p}1 × 1 × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and achieve the content/spatial- and task-adaptive with the cooperation of them. To this end, we decouple the large dynamic prompt with the size of H×W×Cp𝐻𝑊subscript𝐶𝑝H\times W\times C_{p}italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT into two smaller matrices, i.e., the coefficients 𝐰𝐈subscript𝐰𝐈\mathbf{w_{I}}bold_w start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT with the size of H×W×D𝐻𝑊𝐷H\times W\times Ditalic_H × italic_W × italic_D and D𝐷Ditalic_D basic prompts with the size of 1×1×Cp11subscript𝐶𝑝1\times 1\times C_{p}1 × 1 × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. We can understand that for each spatial position {i,j}𝑖𝑗\{i,j\}{ italic_i , italic_j }, there is one group of coefficients wI(i,j)subscript𝑤𝐼𝑖𝑗w_{I}(i,j)italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_i , italic_j ) to combine D𝐷Ditalic_D basic prompts. thereby being content/spatial-adaptive. To let the dynamic prompt perceive the task information, we generate the coefficients with the feature of input images directly, thereby being task-adaptive and suitable for any input size. Our implementation has two advantages: 1) no extra operations to adjust the spatial size of prompts, and thus the guidance information from prompts is explicit and accurate; 2) our prompts have fewer parameters and are more computationally-friendly compared to previous methods [45].

Refer to caption
Figure 2: The architecture of DPM. To dynamically aggregate content/spatial-aware task-adaptive contextual information, we introduce few number of basic dynamic kernels into the generation process of our prompt. Moreover, our design maintains adaptability to arbitrary input resolutions.

The overall architecture of DPMDPM\operatorname{DPM}roman_DPM is shown in Fig. 2, where the learnable basic prompts PID×1×1×CPsubscriptPIsuperscript𝐷11subscript𝐶𝑃\textbf{P}_{\text{I}}\in\mathbb{R}^{D\times 1\times 1\times C_{P}}P start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 1 × 1 × italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are set. Here, the D𝐷Ditalic_D and CPsubscript𝐶𝑃C_{P}italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are the number of base prompts and the channel dimension of prompts. To generate dynamic prompt coefficients from input features FXH×W×CsubscriptFXsuperscript𝐻𝑊𝐶\textbf{F}_{\text{X}}\in\mathbb{R}^{H\times W\times C}F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, an MLP layer is applied to extract the degradation prior and transform the channel dimension from C𝐶Citalic_C to the number of basic prompts D𝐷Ditalic_D. Then, the softmaxsoftmax\operatorname{softmax}roman_softmax operation is exploited to generate the composable coefficients wID×H×W×1subscriptwIsuperscript𝐷𝐻𝑊1\textbf{w}_{\text{I}}\in\mathbb{R}^{D\times H\times W\times 1}w start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H × italic_W × 1 end_POSTSUPERSCRIPT for basic prompts. Based on the inversion of the above dynamic prompt decomposition, we can obtain the dynamic prompt as:

wI=Softmax(MLP(FX)),P=D(wIPI)formulae-sequencesubscriptwISoftmaxMLPsubscriptFXPsuperscript𝐷direct-productsubscriptwIsubscriptPI\displaystyle\textbf{w}_{\text{I}}=\operatorname{Softmax}(\operatorname{MLP}(% \textbf{F}_{\text{X}})),\quad\textbf{P}=\sum^{D}\left(\textbf{w}_{\text{I}}% \odot\textbf{P}_{\text{I}}\right)w start_POSTSUBSCRIPT I end_POSTSUBSCRIPT = roman_Softmax ( roman_MLP ( F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT ) ) , P = ∑ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( w start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ⊙ P start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ) (1)

3.2 Prompt-guided Token Mixer Block

3.2.1 Prompt-guided token mixer module

After obtaining the dynamic prompt, we can exploit it to guide the restoration network for universal CSR tasks. Recently, Active Token Mixer (ATM) [64] gain great success in high-level vision tasks due to their well-designed token-mixing strategies. In contrast to transformer architecture, where the contextual information modeling is performed with the interactions between any two tokens, ATM utilize the deformable convolution to predict the offset of mostly relevant tokens, achieving the implicit contextual information modeling in the horizontal and vertical directions with offset generation.

Inspired by this, we propose the Dynamic Prompt-guided Token Mixer Module, dubbed PTMM by exploiting the dynamic prompt generated with DPM to guide the prediction of the offset of most informative tokens for contextual modeling. Concretely, PTMM leverages deformable convolutions and offsets to adaptively fuse tokens across horizontal and vertical axes, regardless of diverse degradation. However, as mentioned in [56], MLP-like modules exhibit diminished efficacy in the extraction of local relevance, which is essential for compressed super-resolution tasks. Therefore, we introduce a depth convolution around the target pixel to achieve the local information extraction.

As shown in Fig. 1(b), PTMM first extracts vertical and horizontal representative offsets 𝐎V,𝐎Hsuperscript𝐎𝑉superscript𝐎𝐻\mathbf{O}^{V},\mathbf{O}^{H}bold_O start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT , bold_O start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT by two sets of fully connected layers. To incorporate task-adaptive information during offset generation, we concatenate dynamic prompt generated from DPM with input features FXsubscriptFX\textbf{F}_{\text{X}}F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT as the condition:

𝐎{V,H}=FC{V,H}(Concat([FX,P]))superscript𝐎𝑉𝐻subscriptFC𝑉𝐻ConcatsubscriptFXP\mathbf{O}^{\{V,H\}}=\operatorname{FC}_{\{V,H\}}(\operatorname{Concat}([% \textbf{F}_{\text{X}},\textbf{P}]))bold_O start_POSTSUPERSCRIPT { italic_V , italic_H } end_POSTSUPERSCRIPT = roman_FC start_POSTSUBSCRIPT { italic_V , italic_H } end_POSTSUBSCRIPT ( roman_Concat ( [ F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT , P ] ) ) (2)

Then, we use the offset to recompose features along one certain axis into a new token 𝐱~{V,H}superscript~𝐱𝑉𝐻\tilde{\mathbf{x}}^{\{V,H\}}over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT { italic_V , italic_H } end_POSTSUPERSCRIPT by the deformable convolution for information fusion (i.e., token mixer). In addition, we adopt a depth convolution to achieve the local information extraction:

𝐱~L=Conv3×3(FX)superscript~𝐱𝐿subscriptConv33subscriptFX\tilde{\mathbf{x}}^{L}=\operatorname{Conv_{3\times 3}}(\textbf{F}_{\text{X}})over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = start_OPFUNCTION roman_Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_OPFUNCTION ( F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT ) (3)

After we obtain these three tokens 𝐱~{V,H,L}superscript~𝐱𝑉𝐻𝐿\tilde{\mathbf{x}}^{\{V,H,L\}}over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT { italic_V , italic_H , italic_L } end_POSTSUPERSCRIPT, we adaptively mix them with learned weights, formulated as

F𝐱~=𝜶V𝐱~V+𝜶H𝐱~H+𝜶L𝐱~LsubscriptF~𝐱direct-productsuperscript𝜶𝑉superscript~𝐱𝑉direct-productsuperscript𝜶𝐻superscript~𝐱𝐻direct-productsuperscript𝜶𝐿superscript~𝐱𝐿\textbf{F}_{\tilde{\mathbf{x}}}=\boldsymbol{\alpha}^{V}\odot\tilde{\mathbf{x}}% ^{V}+\boldsymbol{\alpha}^{H}\odot\tilde{\mathbf{x}}^{H}+\boldsymbol{\alpha}^{L% }\odot\tilde{\mathbf{x}}^{L}F start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG end_POSTSUBSCRIPT = bold_italic_α start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ⊙ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT + bold_italic_α start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⊙ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT + bold_italic_α start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⊙ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT (4)

where direct-product\odot denotes element-wise multiplication. 𝜶{V,H,L}superscript𝜶𝑉𝐻𝐿absent\boldsymbol{\alpha}^{\{V,H,L\}}\inbold_italic_α start_POSTSUPERSCRIPT { italic_V , italic_H , italic_L } end_POSTSUPERSCRIPT ∈ Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT are learned from the summation 𝐱~Σsuperscript~𝐱Σ\tilde{\mathbf{x}}^{\Sigma}over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT roman_Σ end_POSTSUPERSCRIPT of 𝐱~{V,H,L}superscript~𝐱𝑉𝐻𝐿\tilde{\mathbf{x}}^{\{V,H,L\}}over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT { italic_V , italic_H , italic_L } end_POSTSUPERSCRIPT with weights W{V,H,L}C×Csuperscript𝑊𝑉𝐻𝐿superscript𝐶𝐶W^{\{V,H,L\}}\in\mathbb{R}^{C\times C}italic_W start_POSTSUPERSCRIPT { italic_V , italic_H , italic_L } end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C denotes the channel dimension:

[𝜶V,𝜶H,𝜶L]=σ([WV𝐱~Σ,WH𝐱~Σ,WL𝐱~Σ]),superscript𝜶𝑉superscript𝜶𝐻superscript𝜶𝐿𝜎superscript𝑊𝑉superscript~𝐱Σsuperscript𝑊𝐻superscript~𝐱Σsuperscript𝑊𝐿superscript~𝐱Σ\left[\boldsymbol{\alpha}^{V},\boldsymbol{\alpha}^{H},\boldsymbol{\alpha}^{L}% \right]=\sigma\left(\left[W^{V}\cdot\tilde{\mathbf{x}}^{\Sigma},W^{H}\cdot% \tilde{\mathbf{x}}^{\Sigma},W^{L}\cdot\tilde{\mathbf{x}}^{\Sigma}\right]\right),[ bold_italic_α start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT , bold_italic_α start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , bold_italic_α start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ] = italic_σ ( [ italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ⋅ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT roman_Σ end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⋅ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT roman_Σ end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ over~ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT roman_Σ end_POSTSUPERSCRIPT ] ) ,

Here, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is a softmax function for normalizing each channel separately.

To further incorporate the task prior for our UCIP, we modulate mixed features F𝐱~subscriptF~𝐱\textbf{F}_{\tilde{\mathbf{x}}}F start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG end_POSTSUBSCRIPT using the aforementioned dynamic prompt P by a SPADE block [44] as the output features of the PTMM, which is shown in the Fig. 1.

3.2.2 Discussions

There are two most relevant MLP-like methods, i.e., MAXIM [56] and ActiveMLP [64]. The differences between MAXIM and our UCIP are as: MAXIM is only designed for specific task, where the cross-gating block and dense connection result in severe computational costs. The differences between ActiveMLP and our UCIP are as: ActiveMLP is designed for classification and focuses more on global information extraction, lacking local perception. Compared with them, our UCIP introduces the simple MLP-based architecture and the dynamic prompt for low-level vision, which is more applicable than the above methods for Universal CSR.

3.2.3 Overall pipeline

To improve the modeling cability of PTMB, we connect N𝑁Nitalic_N PTMMs in a successive way. It is worth noting that, to balance the performance of model and the computational cost, we share the prompt P across all PTMMs within a single PTMB. With respect to offsets, we generate new offsets every two PTMMs. The whole process of PTMB can be formulated as:

P=DPM(FX,PI),FXi+1=PTMM(P,FXi)formulae-sequencePDPMsubscriptFXsubscriptPIsubscriptFsubscriptX𝑖1PTMMPsubscriptFsubscriptX𝑖\textbf{P}=\operatorname{DPM}(\textbf{F}_{\text{X}},\textbf{P}_{\text{I}}),% \quad\textbf{F}_{\text{X}_{i+1}}=\operatorname{PTMM}(\textbf{P},\textbf{F}_{% \text{X}_{i}})P = roman_DPM ( F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT , P start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ) , F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_PTMM ( P , F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (5)

where FXisubscriptFsubscriptX𝑖\textbf{F}_{\text{X}_{i}}F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the input feature of ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT PTMM.

3.3 Overall Framework

As shown in Fig. 1, we build our UCIP following the popular pipeline of compressed super-resolution backbones, which is composed of shallow feature extraction, deep feature restoration, and HR reconstruction modules. Given a low-resolution input image XLRH×W×3subscriptXLRsuperscript𝐻𝑊3\textbf{X}_{\text{LR}}\in\mathbb{R}^{H\times W\times 3}X start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, UCIP first extracts the shallow features FXH×W×CsubscriptFXsuperscript𝐻𝑊𝐶\textbf{F}_{\text{X}}\in\mathbb{R}^{H\times W\times C}F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT using a patch-embedding layer, where H𝐻Hitalic_H, W𝑊Witalic_W are the spatial dimensions of features. Then, we pass FXsubscriptFX\textbf{F}_{\text{X}}F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT through several PTMB to recursively remove the compression artifacts and generate the restored features FXrsubscriptFsubscriptX𝑟\textbf{F}_{\text{X}_{r}}F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Finally, following [63, 34], we use a series of convolution layers and nearest interpolation operations to obtain the final high-resolution output XHRsubscriptXHR\textbf{X}_{\text{HR}}X start_POSTSUBSCRIPT HR end_POSTSUBSCRIPT, which can be represented as:

XHR=Conv(Conv(Conv(FX+FXr)×2)×2)\textbf{X}_{\text{HR}}=\operatorname{Conv}(\operatorname{Conv}(\operatorname{% Conv}(\textbf{F}_{\text{X}}+\textbf{F}_{\text{X}_{r}})\uparrow_{\times 2})% \uparrow_{\times 2})X start_POSTSUBSCRIPT HR end_POSTSUBSCRIPT = roman_Conv ( roman_Conv ( roman_Conv ( F start_POSTSUBSCRIPT X end_POSTSUBSCRIPT + F start_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ↑ start_POSTSUBSCRIPT × 2 end_POSTSUBSCRIPT ) ↑ start_POSTSUBSCRIPT × 2 end_POSTSUBSCRIPT ) (6)

3.4 Our UCSR Dataset

To facilitate current and future research in CSR, we propose the first benchmark dataset for universal CSR, dubbed UCSR dataset, which not only considers traditional compression methods but also learning-based compression methods. We consider 6 types of compression codecs, including 3 most representative traditional codecs JPEG [59], HM [49], VTM [5], and 3 open-sourced learning-based codecs ChengPSNRsubscriptChengPSNR\text{Cheng}_{\text{PSNR}}Cheng start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT [9], ChengSSIMsubscriptChengSSIM\text{Cheng}_{\text{SSIM}}Cheng start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT [9] (abbreviated as CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT and CSSIMsubscriptCSSIM\text{C}_{\text{SSIM}}C start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT in the following paper), HIFIC [42]. Thesse three learning-based codecs are PSNR-oriented and SSIM-oriented variants from [9] and perceptual-oriented GAN-based codecs from [42], respectively. To cover the prominent compression types in real scenarios, we consider four different compression qualities for each codec, except for HIFIC, since only the weights for three bitrate points are released.

To generate the training dataset, we choose the popular DF2K [1, 53], which contains 3450 high-quality images. Each image is downsampled by a scale factor of 4 using MATLAB bicubic algorithm. Then, we compress the downsampled images with six different compression algorithms to yield the training dataset of all competitive methods and our UCIP. The quality factors we used for different codecs are respectively as: (i) [10, 20, 30, 40] for JPEG, where the smaller value means poorer image quality. (ii) [32, 37, 42, 47] for HM, VTM, where value denotes the quantization parameter (QP), and larger value means poor quality. (iii) [1, 2, 3, 4] for CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT, CSSIMsubscriptCSSIM\text{C}_{\text{SSIM}}C start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT, where the smaller value indicates poorer quality. We adopt the implementation in the popular open-sourced compression tools compressai [3]. (iv) [‘low’, ‘med’, ‘high’] for HIFIC, where ‘low’ indicates the poorest image quality. We use the PyTorch implementation [16] to compress images. All the methods are trained from scratch on our proposed benchmarks. We adopt the same process to generate the evaluation datasets based on five commonly used benchmarks: Set5 [4], Set14 [69], BSD100 [40], Urban100 [19] and Manga109 [41].

4 Experiments

Table 1: Quantitative comparison for compressed image super-resolution on traditional codecs. Results are tested on ×4absent4\times 4× 4 with different compression qualities in terms of PSNR\uparrow/SSIM\uparrow. The best performances are in red. Notably, all compared methods are trained from scratch with our proposed UCSR dataset for fair comparisons. 𝒟𝒟\mathcal{D}caligraphic_D denotes for “Datasets”.
𝒟𝒟\mathcal{D}caligraphic_D Methods JPEG [59] HM [49] VTM [5]
𝒬=10𝒬10\mathcal{Q}=10caligraphic_Q = 10 𝒬=20𝒬20\mathcal{Q}=20caligraphic_Q = 20 𝒬=30𝒬30\mathcal{Q}=30caligraphic_Q = 30 𝒬=40𝒬40\mathcal{Q}=40caligraphic_Q = 40 𝒬=47𝒬47\mathcal{Q}=47caligraphic_Q = 47 𝒬=42𝒬42\mathcal{Q}=42caligraphic_Q = 42 𝒬=37𝒬37\mathcal{Q}=37caligraphic_Q = 37 𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32 𝒬=47𝒬47\mathcal{Q}=47caligraphic_Q = 47 𝒬=42𝒬42\mathcal{Q}=42caligraphic_Q = 42 𝒬=37𝒬37\mathcal{Q}=37caligraphic_Q = 37 𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32
Set5 [4] RRDB [63] 24.44/0.676 25.93/0.729 26.70/0.754 27.22/0.769 22.48/0.624 24.48/0.690 26.52/0.752 28.05/0.794 22.70/0.635 24.84/0.706 26.65/0.758 28.10/0.797
SwinIR [34] 24.79/0.703 26.25/0.747 27.07/0.771 27.59/0.783 22.66/0.647 24.54/0.703 26.82/0.765 28.53/0.809 22.81/0.652 24.97/0.716 26.93/0.768 28.72/0.813
Swin2SR [10] 24.80/0.705 26.24/0.752 27.16/0.774 27.64/0.786 22.66/0.650 24.55/0.705 26.81/0.766 28.50/0.809 22.79/0.652 24.91/0.716 26.89/0.769 28.64/0.813
MAXIM [56] 24.83/0.709 26.15/0.751 27.00/0.773 27.44/0.784 22.69/0.648 24.60/0.705 26.75/0.764 28.48/0.808 22.88/0.654 24.96/0.718 26.89/0.770 28.61/0.811
AIRNet [25] 24.67/0.701 26.04/0.745 26.83/0.767 27.30/0.779 22.56/0.640 24.38/0.698 26.55/0.760 28.24/0.805 22.71/0.648 24.81/0.714 26.65/0.766 28.38/0.810
PromptIR [45] 24.82/0.707 26.24/0.751 27.13/0.774 27.62/0.787 22.68/0.652 24.55/0.705 26.87/0.768 28.64/0.813 22.89/0.658 24.99/0.720 26.93/0.771 28.74/0.815
UCIP 25.05/0.715 26.53/0.761 27.44/0.782 27.94/0.794 22.77/0.656 24.76/0.711 27.05/0.772 28.82/0.815 22.89/0.657 25.11/0.722 27.17/0.775 28.95/0.819
Set14 [69] RRDB [63] 23.40/0.579 24.49/0.619 25.01/0.639 25.32/0.651 21.84/0.531 23.48/0.584 24.93/0.635 25.99/0.679 22.12/0.541 23.74/0.594 25.09/0.643 26.05/0.682
SwinIR [34] 23.77/0.596 24.81/0.630 25.32/0.649 25.66/0.662 21.96/0.542 23.59/0.593 25.13/0.645 26.38/0.695 22.19/0.550 23.79/0.599 25.32/0.652 26.49/0.699
Swin2SR [10] 23.79/0.597 24.84/0.631 25.36/0.651 25.68/0.663 21.97/0.543 23.59/0.594 25.17/0.646 26.42/0.697 22.18/0.550 23.77/0.600 25.30/0.652 26.48/0.700
MAXIM [56] 23.79/0.597 24.83/0.632 25.33/0.651 25.66/0.663 22.02/0.543 23.60/0.593 25.15/0.645 26.39/0.694 22.24/0.551 23.83/0.601 25.33/0.653 26.48/0.699
AIRNet [25] 23.61/0.593 24.64/0.629 25.13/0.647 25.43/0.659 21.90/0.540 23.47/0.591 24.97/0.642 26.18/0.691 22.11/0.548 23.68/0.598 25.12/0.650 26.24/0.696
PromptIR [45] 23.79/0.599 24.84/0.634 25.34/0.652 25.67/0.664 21.99/0.544 23.53/0.594 25.17/0.647 26.44/0.697 22.21/0.552 23.78/0.601 25.34/0.654 26.50/0.701
UCIP 23.93/0.602 24.99/0.637 25.53/0.657 25.88/0.669 22.10/0.547 23.70/0.597 25.34/0.650 26.63/0.701 22.28/0.553 23.89/0.603 25.45/0.656 26.71/0.705
BSD100 [40] RRDB [63] 23.56/0.547 24.44/0.580 24.86/0.597 25.12/0.609 22.10/0.503 23.43/0.542 24.64/0.588 25.58/0.630 22.30/0.510 23.64/0.550 24.80/0.595 25.66/0.634
SwinIR [34] 23.79/0.557 24.62/0.587 25.04/0.604 25.31/0.616 22.17/0.510 23.45/0.548 24.74/0.596 25.80/0.643 22.34/0.516 23.66/0.555 24.91/0.603 25.92/0.649
Swin2SR [10] 23.79/0.557 24.62/0.588 25.03/0.605 25.30/0.617 22.15/0.511 23.42/0.549 24.72/0.596 25.81/0.645 22.32/0.516 23.60/0.555 24.88/0.603 25.91/0.650
MAXIM [56] 23.81/0.558 24.63/0.589 25.04/0.606 25.30/0.618 22.20/0.510 23.49/0.548 24.73/0.595 25.79/0.644 22.36/0.516 23.67/0.556 24.89/0.603 25.90/0.650
AIRNet [25] 23.73/0.555 24.55/0.586 24.95/0.603 25.22/0.615 22.13/0.509 23.42/0.547 24.68/0.594 25.71/0.642 22.31/0.515 23.62/0.554 24.83/0.602 25.81/0.648
PromptIR [45] 23.82/0.559 24.65/0.589 25.05/0.606 25.32/0.618 22.20/0.511 23.48/0.549 24.75/0.597 25.82/0.645 22.35/0.517 23.66/0.556 24.91/0.604 25.93/0.651
UCIP 23.88/0.561 24.73/0.593 25.15/0.610 25.42/0.623 22.24/0.513 23.56/0.551 24.84/0.599 25.93/0.649 22.38/0.517 23.74/0.558 24.99/0.606 26.03/0.654
Urban100 [19] RRDB [63] 21.69/0.578 22.18/0.597 22.66/0.622 22.97/0.638 20.42/0.531 21.66/0.578 22.84/0.633 23.61/0.671 20.67/0.543 21.95/0.593 23.00/0.641 23.66/0.674
SwinIR [34] 21.74/0.580 22.61/0.621 23.11/0.646 23.41/0.661 20.45/0.535 21.86/0.595 23.18/0.654 24.12/0.699 20.70/0.546 22.10/0.607 23.33/0.662 24.19/0.703
Swin2SR [10] 21.79/0.582 22.67/0.624 23.17/0.648 23.44/0.664 20.48/0.536 21.90/0.597 23.21/0.655 24.16/0.700 20.72/0.548 22.11/0.608 23.34/0.662 24.22/0.703
MAXIM [56] 21.78/0.582 22.61/0.622 23.08/0.645 23.38/0.660 20.47/0.534 21.87/0.594 23.13/0.651 24.05/0.695 20.72/0.547 22.10/0.606 23.28/0.659 24.11/0.698
AIRNet [25] 21.57/0.574 22.40/0.615 22.86/0.639 23.14/0.655 20.35/0.530 21.72/0.590 22.97/0.648 23.87/0.692 20.60/0.543 21.96/0.603 23.12/0.657 23.92/0.696
PromptIR [45] 21.81/0.587 22.65/0.626 23.12/0.649 23.42/0.664 20.50/0.539 21.89/0.598 23.17/0.656 24.13/0.701 20.73/0.550 22.11/0.609 23.32/0.663 24.18/0.704
UCIP 22.00/0.596 22.88/0.637 23.39/0.664 23.71/0.677 20.59/0.542 22.05/0.604 23.39/0.661 24.42/0.711 20.80/0.552 22.23/0.614 23.50/0.670 24.46/0.715
Manga109 [41] RRDB [63] 22.50/0.684 23.75/0.730 24.49/0.756 24.99/0.773 21.17/0.655 23.24/0.722 25.07/0.778 26.24/0.813 21.59/0.675 23.64/0.738 25.27/0.786 26.29/0.815
SwinIR [34] 23.05/0.720 24.38/0.762 25.16/0.786 25.67/0.801 21.40/0.677 23.56/0.743 25.64/0.801 27.17/0.841 21.73/0.689 23.90/0.754 25.83/0.807 27.25/0.843
Swin2SR [10] 23.09/0.720 24.40/0.762 25.18/0.786 25.69/0.801 21.42/0.677 23.58/0.743 25.62/0.799 27.11/0.839 21.75/0.690 23.90/0.753 25.78/0.804 27.19/0.841
MAXIM [56] 23.11/0.722 24.41/0.762 25.17/0.786 25.65/0.800 21.41/0.675 23.55/0.740 25.56/0.797 27.05/0.836 21.74/0.688 23.89/0.752 25.76/0.803 27.13/0.838
AIRNet [25] 22.82/0.714 24.07/0.754 24.78/0.778 25.26/0.793 21.25/0.670 23.34/0.735 25.29/0.793 26.69/0.833 21.59/0.684 23.67/0.747 25.47/0.800 26.74/0.835
PromptIR [45] 23.15/0.726 24.48/0.767 25.23/0.789 25.71/0.804 21.41/0.681 23.59/0.746 25.62/0.801 27.15/0.841 21.73/0.692 23.90/0.755 25.80/0.807 27.21/0.843
UCIP 23.36/0.734 24.77/0.775 25.58/0.798 26.11/0.813 21.54/0.683 23.79/0.750 25.94/0.808 27.61/0.848 21.82/0.693 24.06/0.759 26.08/0.812 27.68/0.850

Our objective is to develop an MLP-like model that caters to a wide range of compressed image super-resolution tasks. Thus, we evaluate our UCIP on six different CSR tasks, including three traditional compression codecs: JPEG [59], HM [49], VTM [5]; and three learning-based compression codecs: CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT [9], CSSIMsubscriptCSSIM\text{C}_{\text{SSIM}}C start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT [9], HIFIC [42].

4.0.1 Implement details

We train our UCIP from scratch in an end-to-end manner. We employ an Adam optimizer with initial learning rate of 3e-4. The learning rate is halved after 200k iterations, and the total number of iterations is set to 40w. The network is optimized by L1 loss. During training, we randomly cropped degraded low-resolution images into patches of size 64×64646464\times 6464 × 64, and 256×256256256256\times 256256 × 256 for high-resolution counterparts as well. Following previous works, random horizontal and vertical flips are utilized to augment training data. The total batch size is set to 32. For our baseline model, we use 6 PTMBs for UCIP and 6 PTMMs for each PTMB.

4.0.2 Training details

To ensure fair comparisons, we train all the competitive methods following their official released codes on our proposed CSR training dataset with the same batch size. The performance are evaluated under the same training iterations.

4.1 Comparisons with State-of-the-arts

We evaluate UCIP with six state-of-the-art models on our CSR benchmarks which composes of five commonly adopted datasets: Set5 [4], Set14 [69], BSD100 [40], Urban100 [19] and Manga109 [41]. The compared models include the fully-convolutional network RRDB [63], the transformer-based image restoration model SwinIR [34] and its upgraded version Swin2SR [10], the MLP-like model MAXIM [56] and two multi-task models AIRNet [25] and PromptIR [45]. We add the HR reconstruction module to last three models, enabling them to perform super-resolution tasks. All compared methods are trained from scratch with our proposed UCSR dataset for fair comparisons.

Table 2: Quantitative comparison for compressed image super-resolution on learning-based codecs. Results are tested on ×4absent4\times 4× 4 with different compression qualities in terms of PSNR\uparrow/SSIM\uparrow. The best performances are in red. Notice that, as HIFIC [42] does not support some low-resolution images from downsampled Set5 and Set14 datasets, we do not use HIFIC codec to compress these two datasets. Notably, all compared methods are trained from scratch with our proposed UCSR dataset for fair comparisons. 𝒟𝒟\mathcal{D}caligraphic_D denotes for “Datasets”.
𝒟𝒟\mathcal{D}caligraphic_D Methods Params CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT [9] CSSIMsubscriptCSSIM\text{C}_{\text{SSIM}}C start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT [9] HIFIC [42]
𝒬=1𝒬1\mathcal{Q}=1caligraphic_Q = 1 𝒬=2𝒬2\mathcal{Q}=2caligraphic_Q = 2 𝒬=3𝒬3\mathcal{Q}=3caligraphic_Q = 3 𝒬=4𝒬4\mathcal{Q}=4caligraphic_Q = 4 𝒬=1𝒬1\mathcal{Q}=1caligraphic_Q = 1 𝒬=2𝒬2\mathcal{Q}=2caligraphic_Q = 2 𝒬=3𝒬3\mathcal{Q}=3caligraphic_Q = 3 𝒬=4𝒬4\mathcal{Q}=4caligraphic_Q = 4 𝒬=‘low’𝒬‘low’\mathcal{Q}=\text{`low'}caligraphic_Q = ‘low’ 𝒬=‘med’𝒬‘med’\mathcal{Q}=\text{`med'}caligraphic_Q = ‘med’ 𝒬=‘high’𝒬‘high’\mathcal{Q}=\text{`high'}caligraphic_Q = ‘high’
Set5 [4] RRDB [63] 16.70M 24.54/0.698 25.40/0.725 26.14/0.746 27.37/0.781 21.19/0.591 21.91/0.621 22.94/0.654 23.69/0.684 - - -
SwinIR [34] 11.72M 24.55/0.704 25.50/0.731 26.25/0.753 27.70/0.790 21.19/0.595 21.93/0.629 22.96/0.663 23.70/0.689 - - -
Swin2SR [10] 12.05M 24.56/0.702 25.51/0.732 26.27/0.756 27.69/0.792 21.23/0.595 21.95/0.629 22.98/0.662 23.77/0.691 - - -
MAXIM [56] 26.74M 24.58/0.704 25.49/0.733 26.27/0.755 27.66/0.791 21.26/0.595 22.01/0.631 23.03/0.663 23.78/0.692 - - -
AIRNet [25] 7.76M 24.54/0.702 25.41/0.730 26.16/0.751 27.50/0.789 21.15/0.595 21.91/0.629 22.93/0.660 23.67/0.689 - - -
PromptIR [45] 35.72M 24.48/0.700 25.42/0.730 26.16/0.751 27.72/0.793 21.26/0.600 21.99/0.632 22.98/0.661 23.66/0.686 - - -
UCIP 11.42M 24.65/0.705 25.59/0.736 26.39/0.758 27.93/0.796 21.30/0.607 21.98/0.633 23.01/0.663 23.74/0.689 - - -
Set14 [69] RRDB [63] 16.70M 23.52/0.588 24.17/0.611 24.76/0.632 25.51/0.662 21.19/0.526 21.70/0.545 22.38/0.564 23.04/0.586 - - -
SwinIR [34] 11.72M 23.61/0.591 24.29/0.615 24.92/0.638 25.82/0.673 21.20/0.527 21.69/0.545 22.39/0.566 23.03/0.587 - - -
Swin2SR [10] 12.05M 23.61/0.591 24.29/0.615 24.91/0.638 25.82/0.673 21.20/0.527 21.72/0.546 22.40/0.566 23.05/0.587 - - -
MAXIM [56] 26.74M 23.61/0.592 24.31/0.615 24.94/0.639 25.82/0.673 21.23/0.527 21.75/0.546 22.42/0.565 23.10/0.588 - - -
AIRNet [25] 7.76M 23.53/0.590 24.19/0.614 24.82/0.637 25.65/0.672 21.14/0.526 21.66/0.545 22.35/0.565 23.00/0.587 - - -
PromptIR [45] 35.72M 23.62/0.592 24.30/0.616 24.95/0.639 25.85/0.675 21.20/0.528 21.73/0.547 22.43/0.566 23.08/0.588 - - -
UCIP 11.42M 23.66/0.593 24.34/0.617 25.00/0.641 25.97/0.678 21.24/0.529 21.73/0.547 22.41/0.567 23.10/0.590 - - -
BSD100 [40] RRDB [63] 16.70M 23.55/0.548 24.15/0.569 24.67/0.590 25.27/0.618 21.95/0.507 22.44/0.523 23.07/0.540 23.53/0.556 21.27/0.521 21.99/0.550 22.38/0.575
SwinIR [34] 11.72M 23.58/0.549 24.19/0.573 24.73/0.595 25.42/0.627 21.95/0.508 22.45/0.524 23.07/0.541 23.54/0.557 21.46/0.531 22.11/0.556 22.59/0.581
Swin2SR [10] 12.05M 23.57/0.550 24.19/0.573 24.71/0.595 25.39/0.627 21.95/0.507 22.44/0.524 23.07/0.541 23.53/0.557 21.44/0.531 22.12/0.557 22.55/0.582
MAXIM [56] 26.74M 23.59/0.550 24.20/0.573 24.74/0.595 25.42/0.628 21.97/0.508 22.47/0.524 23.10/0.541 23.56/0.557 21.51/0.531 22.17/0.557 22.54/0.580
AIRNet [25] 7.76M 23.56/0.549 24.16/0.572 24.70/0.595 25.36/0.627 21.95/0.507 22.44/0.524 23.05/0.541 23.51/0.557 21.44/0.530 22.11/0.556 22.51/0.579
PromptIR [45] 35.72M 23.59/0.550 24.21/0.573 24.75/0.596 25.43/0.628 21.96/0.508 22.47/0.524 23.09/0.541 23.55/0.558 21.49/0.532 22.14/0.558 22.66/0.583
UCIP 11.42M 23.61/0.551 24.24/0.575 24.77/0.597 25.49/0.630 21.98/0.508 22.48/0.525 23.12/0.543 23.59/0.560 21.94/0.534 22.19/0.559 23.39/0.587
Urban100 [19] RRDB [63] 16.70M 21.70/0.580 22.21/0.603 22.60/0.622 23.21/0.654 19.84/0.504 20.29/0.524 20.81/0.546 21.31/0.570 20.70/0.540 21.42/0.570 21.89/0.593
SwinIR [34] 11.72M 21.78/0.587 22.32/0.613 22.76/0.633 23.52/0.672 19.87/0.509 20.31/0.530 20.86/0.553 21.36/0.579 20.83/0.549 21.60/0.579 22.11/0.605
Swin2SR [10] 12.05M 21.82/0.588 22.38/0.614 22.80/0.635 23.52/0.672 19.89/0.510 20.33/0.531 20.89/0.554 21.41/0.580 20.87/0.550 21.65/0.580 22.16/0.607
MAXIM [56] 26.74M 21.80/0.587 22.34/0.612 22.77/0.633 23.49/0.670 19.90/0.509 20.35/0.530 20.89/0.553 21.40/0.578 20.87/0.549 21.63/0.579 22.14/0.605
AIRNet [25] 7.76M 21.71/0.585 22.23/0.610 22.65/0.631 23.34/0.668 19.84/0.507 20.27/0.527 20.81/0.550 21.30/0.575 20.77/0.546 21.49/0.576 21.98/0.600
PromptIR [45] 35.72M 21.82/0.589 22.35/0.614 22.79/0.635 23.53/0.673 19.92/0.511 20.36/0.532 20.91/0.555 21.42/0.581 20.88/0.552 21.64/0.583 22.19/0.610
UCIP 11.42M 21.89/0.593 22.46/0.620 22.91/0.641 23.72/0.682 19.98/0.516 20.43/0.538 21.01/0.563 21.56/0.590 20.95/0.555 21.77/0.588 22.30/0.616
Manga109 [41] RRDB [63] 16.70M 23.20/0.725 23.90/0.746 24.43/0.762 25.47/0.794 20.12/0.635 20.75/0.657 21.55/0.680 22.27/0.705 21.28/0.671 22.41/0.705 23.10/0.729
SwinIR [34] 11.72M 23.31/0.733 24.13/0.760 24.68/0.775 26.01/0.813 20.12/0.641 20.75/0.664 21.58/0.689 22.34/0.715 21.32/0.680 22.55/0.716 23.31/0.742
Swin2SR [10] 12.05M 23.33/0.733 24.08/0.757 24.68/0.774 25.91/0.809 20.15/0.641 20.79/0.664 21.62/0.689 22.38/0.715 21.36/0.679 22.61/0.715 23.37/0.742
MAXIM [56] 26.74M 23.31/0.732 24.08/0.756 24.67/0.774 25.95/0.810 20.15/0.640 20.78/0.663 21.60/0.687 22.34/0.712 21.36/0.679 22.58/0.715 23.33/0.741
AIRNet [25] 7.76M 23.17/0.729 23.89/0.752 24.47/0.770 25.66/0.807 20.08/0.637 20.70/0.660 21.49/0.684 22.22/0.709 21.33/0.676 22.42/0.711 23.18/0.736
PromptIR [45] 35.72M 23.33/0.734 24.09/0.758 24.70/0.775 25.97/0.812 20.15/0.642 20.79/0.665 21.62/0.690 22.36/0.715 21.39/0.682 22.58/0.718 23.40/0.745
UCIP 11.42M 23.43/0.737 24.24/0.762 24.88/0.781 26.29/0.819 20.19/0.645 20.85/0.669 21.73/0.696 22.53/0.723 21.41/0.684 22.70/0.723 23.55/0.750

As demonstrated in Table. 1 and Table. 2, our UCIP outperforms all other methods on almost all codecs and compression qualities. Particularly, UCIP achieves PSNR gain of up to 0.45dB against PromptIR [45] with only one-third the number of parameters. Another intriguing observation is that the gains provided by UCIP become more significant as the compression ratio decreases. We attribute this to the preservation of more high-frequency information at milder compression levels. The abundance of high-frequency details further enhances the capability of PTMM to conduct global-wise informative tokens extraction, thus leads to a better performance.

Refer to caption
Figure 3: Visual comparisons between UCIP and other state-of-the-art methods. To demonstrate the effectiveness of UCIP across different codecs, we display four rows of images, each compressed with JPEG(𝒬=10𝒬10\mathcal{Q}=10caligraphic_Q = 10), HM(𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32), CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT(𝒬=4𝒬4\mathcal{Q}=4caligraphic_Q = 4) and HIFIC(𝒬=‘med’𝒬‘med’\mathcal{Q}=\text{`med'}caligraphic_Q = ‘med’), respectively. We show more results in the Sec. 7.

As illustrated in Fig. 3, UCIP leverages the implicit guidance of the dynamic prompt to recover more textural details while avoiding the generation of artifacts. Specifically, as observed in the first row, our model recovers the clearest texture of the monarch. Besides, in the second and final rows, images reconstructed by our method exhibit clearer edges and fewer distorted lines. For the third row, our method successfully removes compression artifacts, while other methods suffer from blocked and blurry outputs. We attribute these performances to the generation of the dynamic prompt and the fusion of global tokens with local features.

It is noteworthy that, though we do not specifically tailor prompts for various compression qualities within certain codec, experimental evidence suggests that our dynamic prompt not only possesses task-specific adaptability but is also capable of handling different distortion degrades. As shown in Fig. 4, our method maintains robust image restoration capabilities across three levels of compression qualities (e.g., always recovers straight lines on the right side of image) .

Table 3: Quantitative comparison for different tuning methods on two new codecs. Results are tested on ×4absent4\times 4× 4 with different compression qualities in terms of PSNR\uparrow/SSIM\uparrow. The first line of each benchmark denotes the baseline model trained with our proposed UCSR dataset, which is directly evaluated on these new codecs. Notice that, for both codecs, the smaller the quality factor, the poorer the image quality. The best performances are in red. Zoom in for best view.
Datasets Pre-train Add prompts Fine-tune WebP [15] ELIC [17]
with prompts in fine-tune which part 𝒬=10𝒬10\mathcal{Q}=10caligraphic_Q = 10 𝒬=20𝒬20\mathcal{Q}=20caligraphic_Q = 20 𝒬=30𝒬30\mathcal{Q}=30caligraphic_Q = 30 𝒬=40𝒬40\mathcal{Q}=40caligraphic_Q = 40 𝒬=1𝒬1\mathcal{Q}=1caligraphic_Q = 1 𝒬=2𝒬2\mathcal{Q}=2caligraphic_Q = 2 𝒬=3𝒬3\mathcal{Q}=3caligraphic_Q = 3 𝒬=4𝒬4\mathcal{Q}=4caligraphic_Q = 4
Set5 - - - 26.20/0.750 27.12/0.777 27.78/0.794 28.33/0.806 22.48/0.643 23.51/0.677 24.63/0.708 25.55/0.735
only prompts 26.25/0.753 27.16/0.778 27.83/0.795 28.40/0.809 22.43/0.639 23.57/0.678 24.81/0.713 25.97/0.745
full model 26.55/0.763 27.49/0.787 28.20/0.805 28.79/0.818 22.41/0.642 23.53/0.678 24.83/0.716 26.06/0.749
only prompts 26.54/0.762 27.49/0.787 28.18/0.804 28.78/0.818 22.42/0.644 23.55/0.679 24.84/0.715 26.03/0.748
Set14 - - - 24.69/0.631 25.41/0.658 25.87/0.675 26.25/0.689 21.84/0.539 22.78/0.566 23.62/0.593 24.37/0.618
only prompts 24.76/0.634 25.43/0.659 25.91/0.678 26.29/0.693 21.86/0.540 22.81/0.567 23.69/0.595 24.58/0.625
full model 24.93/0.641 25.64/0.667 26.12/0.686 26.53/0.701 21.85/0.541 22.81/0.569 23.73/0.597 24.67/0.628
only prompts 24.97/0.642 25.65/0.668 26.14/0.686 26.55/0.701 21.84/0.540 22.81/0.568 23.73/0.597 24.69/0.629
BSD100 - - - 24.36/0.583 24.93/0.608 25.34/0.626 25.63/0.639 22.14/0.507 22.90/0.528 23.61/0.551 24.20/0.574
only prompts 24.44/0.587 25.01/0.612 25.42/0.630 25.74/0.644 22.12/0.508 22.89/0.529 23.63/0.553 24.33/0.580
full model 24.55/0.592 25.11/0.616 25.52/0.635 25.85/0.649 22.10/0.509 22.88/0.530 23.64/0.554 24.36/0.581
only prompts 24.55/0.593 25.11/0.617 25.53/0.636 25.86/0.650 22.10/0.509 22.89/0.530 23.64/0.554 24.37/0.582
Urban100 - - - 22.81/0.640 23.43/0.669 23.81/0.687 24.07/0.698 20.49/0.533 21.32/0.569 22.00/0.602 22.54/0.629
only prompts 22.72/0.634 23.30/0.661 23.67/0.679 23.94/0.691 20.48/0.532 21.31/0.568 22.05/0.601 22.70/0.632
full model 23.06/0.653 23.66/0.680 24.05/0.698 24.34/0.710 20.47/0.535 21.33/0.572 22.10/0.607 22.81/0.640
only prompts 23.15/0.656 23.74/0.683 24.13/0.701 24.41/0.713 20.54/0.538 21.40/0.575 22.18/0.610 22.88/0.643
Manga109 - - - 24.77/0.779 25.69/0.805 26.28/0.820 26.70/0.831 21.20/0.671 22.36/0.707 23.36/0.739 24.22/0.766
only prompts 24.76/0.776 25.65/0.801 26.23/0.817 26.66/0.828 21.22/0.671 22.42/0.708 23.52/0.741 24.60/0.774
full model 25.09/0.789 26.01/0.813 26.65/0.829 27.16/0.841 21.19/0.670 22.42/0.711 23.58/0.746 24.73/0.780
only prompts 25.18/0.791 26.11/0.815 26.75/0.831 27.25/0.843 21.23/0.675 22.48/0.713 23.65/0.747 24.82/0.783

4.2 Prompt Tuning for UCIP

Prompt learning can be utilized in two popular ways: (i) One is to utilize prompt learning for multi-task learning, e.g., PromptIR [45], ProRes [39], PIP [32] in low-level vision, which needs to train whole model from scratch; (ii) another is prompt tuning, which requires a strong baseline model and aims to optimize only a small part of parameters for downstream tasks. Notably, in the CSR field, there are no pre-trained baseline models on multiple types of compression artifacts existed, which prevents us to study prompt tuning in the beginning. And thus, we build the first Universal CSR framework and corresponding dataset with the first way, which follows existing prompt learning works in low-level vision [45, 39, 32, 38]. However, training a model from scratch is time consuming. To further explore the potential of our proposed UCIP in prompt tuning, we choose two unseen codecs, including one traditional codec WebP [15] and one learning-based codec ELIC [17], to fine-tune UCIP. In Tab. 3, we explore four ways of fine-tuning: i) directly evaluated pre-trained model without fine-tuning. ii) pre-training without prompt, then adding prompt and only training prompt parameters on new tasks. iii) pre-training without prompt, then adding prompt and only training prompt parameters on new tasks. iv) pre-training with prompt, then fine-tuning only prompt parameters on new tasks. All the experiments are conducted under the same settings with the same training iterations. As shown in Tab. 3, compared between ii) and iii), tuning only prompt achieves comparable performance on ELIC codec against tuning full model. Compared between iii) and iv), tuning only prompt based on UCIP achieves comparable and even better performances against tuning full model after adding prompt. The experimental results indicate that our proposed UCIP can serve as a strong baseline model in CSR field, which will also benefit the prompt tuning for new codecs in future works.

Refer to caption
Figure 4: Visual comparisons between UCIP and other state-of-the-art methods under different compression qualities within HIFIC [42] codec. The qualities of HIFIC from top to bottom are ‘low’, ‘medium’, and ‘high’, respectively. We show more results in the Sec. 7.

4.3 Ablation Studies

4.3.1 The effects of dynamic prompt

To validate the effectiveness of our DPM, we conduct experiments on the different prompt designs. The results are shown in Table. 4. Specifically, without the dynamic prompt, UCIP is unable to perform task-wise informative token selection. Moreover, the use of fixed prompts may even impair the performance of UCIP, as they could provide incorrect guidance during the token mixing process. Compared to PromptIR [45], our DPM utilizes very few parameters to achieve the spatial-adaptive modulation for tasks by only a few basic dynamic prompt kernels, thereby achieving a PSNR gain of up to 0.22dB.

Table 4: Impacts of different prompt generation strategies. Results are reported on Manga109 [41]. Flops is calculated based on input with the size 1×3×64×641364641\times 3\times 64\times 641 × 3 × 64 × 64.
Method Params(M) Flops(G) Codecs
JPEG(𝒬=40𝒬40\mathcal{Q}=40caligraphic_Q = 40) HM(𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32) HIFIC(𝒬=‘high’𝒬‘high’\mathcal{Q}=\text{`high'}caligraphic_Q = ‘high’)
w/o Prompt - - 25.70/0.805 27.22/0.840 23.39/0.745
Fixed - - 25.62/0.799 27.10/0.839 23.05/0.723
PromptIR [45] 9.66 0.900 25.89/0.810 27.42/0.845 23.50/0.749
Ours 0.46 0.024 26.11/0.813 27.61/0.848 23.55/0.750

4.3.2 The effects of local feature extraction

As demonstrated in Sec. 3.2.1, local feature extraction is essential for the model to aggregate useful local information with the content/spatial-aware task-adaptive contextual information. To validate this point, we conduct an ablation which replaces the local convolution with the identity module. As shown in Table. 6, PSNR drops about 0.1dB without local feature extraction, which indicates that incorporate global tokens with local features are necessary for CSR tasks.

Table 5: Impacts of local feature extraction. Results are reported on Manga109 [41].
Codecs Methods
w/o Conv3×3subscriptConv33\operatorname{Conv}_{3\times 3}roman_Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT Ours
JPEG(𝒬=40𝒬40\mathcal{Q}=40caligraphic_Q = 40) 25.97/0.809 26.11/0.813
HM(𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32) 27.46/0.844 27.61/0.848
HIFIC(𝒬=‘high’𝒬‘high’\mathcal{Q}=\text{`high'}caligraphic_Q = ‘high’) 23.48/0.747 23.55/0.750
Table 6: Experiments about number of the dynamic prompt. Results are reported on Manga109 [41].
Number Codecs
JPEG(𝒬=40𝒬40\mathcal{Q}=40caligraphic_Q = 40) HM(𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32) HIFIC(𝒬=‘high’𝒬‘high’\mathcal{Q}=\text{`high'}caligraphic_Q = ‘high’)
1 25.84/0.807 27.42/0.844 23.46/0.747
2 25.88/0.808 27.42/0.845 23.48/0.748
4 25.99/0.811 27.53/0.847 23.52/0.748
8(Ours) 26.11/0.813 27.61/0.848 23.55/0.750
16 26.19/0.815 27.66/0.850 23.59/0.751

4.3.3 The effects of number of the dynamic prompt

To mine the content/spatial-aware task-adaptive contextual information for the universal CSR task, we introduce the dynamic prompt. In this part, we investigate the optimal number of the dynamic prompt. As demonstrated in Table 6, there is a noticeable constraint on the dynamic capacity of prompts for spatial content interpretation and degradation handling when the number of the dynamic prompt is small. As the number incrementally increases, the observed performance gap narrows, falling below our expectations. We attribute this to the inadequate weighting from input image features, primarily due to the constrained capabilities of a singular MLP layer. To strike a balance between performance and computational efficiency, we choose 8 as the number of the dynamic prompt.

5 Conclusion

In this paper, we present UCIP, the first universal Compressed Image Super-resolution model, which leverages a novel dynamic prompt structure with multi-layer perception (MLP)-like framework. Distinct from existing CSR works focused on a single compression codec JPEG, UCIP effectively addresses hybrid distortions across a spectrum of codecs. By utilizing the prompt-guided token mixer block (PTMB), it dynamically identifies and refines the content/spatial-aware task-adaptive contextual information, optimizing for different tasks and distortions. Our extensive experiments on the proposed comprehensive UCSR benchmarks confirm that UCIP not only achieves state-of-the-art performance but also demonstrates remarkable versatility and applicability. In future work, we will exploit the potential of UCIP and further improve both objective and subjective performances on UCSR benchmarks.

Acknowledgement

This work was supported in part by NSFC under Grant 623B2098, 62021001, and 62371434. This work was mainly completed before March 2024.

References

  • [1] Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017)
  • [2] Ai, Y., Huang, H., Zhou, X., Wang, J., He, R.: Multimodal prompt perceiver: Empower adaptiveness, generalizability and fidelity for all-in-one image restoration. arXiv preprint arXiv:2312.02918 (2023)
  • [3] Bégaint, J., Racapé, F., Feltman, S., Pushparaja, A.: Compressai: a pytorch library and evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029 (2020)
  • [4] Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
  • [5] Bross, B., Wang, Y.K., Ye, Y., Liu, S., Chen, J., Sullivan, G.J., Ohm, J.R.: Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology 31(10), 3736–3764 (2021)
  • [6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [7] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12299–12310 (2021)
  • [8] Chen, S., Xie, E., Ge, C., Liang, D., Luo, P.: Cyclemlp: A mlp-like architecture for dense prediction. arXiv preprint arXiv:2107.10224 (2021)
  • [9] Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939–7948 (2020)
  • [10] Conde, M.V., Choi, U.J., Burchi, M., Timofte, R.: Swin2sr: Swinv2 transformer for compressed image super-resolution and restoration. arXiv preprint arXiv:2209.11345 (2022)
  • [11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [12] Fritsche, M., Gu, S., Timofte, R.: Frequency separation for real-world super-resolution. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3599–3608. IEEE (2019)
  • [13] Gao, H., Yang, J., Wang, N., Yang, J., Zhang, Y., Dang, D.: Prompt-based all-in-one image restoration using cnns and transformer. arXiv preprint arXiv:2309.03063 (2023)
  • [14] Gao, W., Tao, L., Zhou, L., Yang, D., Zhang, X., Guo, Z.: Low-rate image compression with super-resolution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 154–155 (2020)
  • [15] Google: Web picture format. https://chromium.googlesource.com/webm/libweb, (2010)
  • [16] Grace Han, J.T.: high-fidelity-generative-compression. https://github.com/Justin-Tan/high-fidelity-generative-compression, (2020)
  • [17] He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5718–5727 (2022)
  • [18] Hou, Q., Jiang, Z., Yuan, L., Cheng, M.M., Yan, S., Feng, J.: Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1), 1328–1334 (2022)
  • [19] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015)
  • [20] Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European Conference on Computer Vision. pp. 709–727. Springer (2022)
  • [21] Jiang, J., Zhang, K., Timofte, R.: Towards flexible blind jpeg artifacts removal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4997–5006 (2021)
  • [22] Kong, X., Dong, C., Zhang, L.: Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv preprint arXiv:2401.03379 (2024)
  • [23] Li, B., Li, X., Lu, Y., Feng, R., Guo, M., Zhao, S., Zhang, L., Chen, Z.: Promptcir: Blind compressed image restoration with prompt learning. arXiv preprint arXiv:2404.17433 (2024)
  • [24] Li, B., Li, X., Lu, Y., Liu, S., Feng, R., Chen, Z.: Hst: Hierarchical swin transformer for compressed image super-resolution. arXiv preprint arXiv:2208.09885 (2022)
  • [25] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17452–17462 (2022)
  • [26] Li, H., Trocan, M., Sawan, M., Galayko, D.: Cswin2sr: Circular swin2sr for compressed image super-resolution. arXiv preprint arXiv:2301.08749 (2023)
  • [27] Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)
  • [28] Li, X., Jin, X., Fu, J., Yu, X., Tong, B., Chen, Z.: Few-shot real image restoration via distortion-relation guided transfer learning. arXiv preprint arXiv:2111.13078 (2021)
  • [29] Li, X., Ren, Y., Jin, X., Lan, C., Wang, X., Zeng, W., Wang, X., Chen, Z.: Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388 (2023)
  • [30] Li, X., Shi, J., Chen, Z.: Task-driven semantic coding via reinforcement learning. IEEE Transactions on Image Processing 30, 6307–6320 (2021)
  • [31] Li, X., Sun, S., Zhang, Z., Chen, Z.: Multi-scale grouped dense network for vvc intra coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 158–159 (2020)
  • [32] Li, Z., Lei, Y., Ma, C., Zhang, J., Shan, H.: Prompt-in-prompt learning for universal image restoration. arXiv preprint arXiv:2312.05038 (2023)
  • [33] Lian, D., Yu, Z., Sun, X., Gao, S.: As-mlp: An axial shifted mlp architecture for vision. arXiv preprint arXiv:2107.08391 (2021)
  • [34] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
  • [35] Liang, Z., Li, C., Zhou, S., Feng, R., Loy, C.C.: Iterative prompt learning for unsupervised backlit image enhancement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8094–8103 (2023)
  • [36] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al.: Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12009–12019 (2022)
  • [37] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
  • [38] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018 (2023)
  • [39] Ma, J., Cheng, T., Wang, G., Zhang, Q., Wang, X., Zhang, L.: Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653 (2023)
  • [40] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 416–423. IEEE (2001)
  • [41] Matsui, Y., Ito, K., Aramaki, Y., Fujimoto, A., Ogawa, T., Yamasaki, T., Aizawa, K.: Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76(20), 21811–21838 (2017)
  • [42] Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression. Advances in Neural Information Processing Systems 33, 11913–11924 (2020)
  • [43] OpenAI: Gpt-4 technical report (2023)
  • [44] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2337–2346 (2019)
  • [45] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
  • [46] Qin, X., Zhu, Y., Li, C., Wang, P., Cheng, J.: Cidbnet: a consecutively-interactive dual-branch network for jpeg compressed image super-resolution. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II. pp. 458–474. Springer (2023)
  • [47] Sohn, K., Chang, H., Lezama, J., Polania, L., Zhang, H., Hao, Y., Essa, I., Jiang, L.: Visual prompt tuning for generative transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19840–19851 (2023)
  • [48] Su, C., Yang, F., Zhang, S., Tian, Q., Davis, L.S., Gao, W.: Multi-task learning with low rank attribute embedding for person re-identification. In: Proceedings of the IEEE international conference on computer vision. pp. 3739–3747 (2015)
  • [49] Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22(12), 1649–1668 (2012)
  • [50] Sun, H., Li, W., Liu, J., Chen, H., Pei, R., Zou, X., Yan, Y., Yang, Y.: Coser: Bridging image and language for cognitive super-resolution. arXiv preprint arXiv:2311.16512 (2023)
  • [51] Tang, C., Zhao, Y., Wang, G., Luo, C., Xie, W., Zeng, W.: Sparse mlp for image recognition: Is self-attention really necessary? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 2344–2351 (2022)
  • [52] Tang, Y., Han, K., Guo, J., Xu, C., Li, Y., Xu, C., Wang, Y.: An image patch is a wave: Phase-aware vision mlp. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10935–10944 (2022)
  • [53] Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 114–125 (2017)
  • [54] Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems 34, 24261–24272 (2021)
  • [55] Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al.: Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
  • [56] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: Multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5769–5780 (2022)
  • [57] Vandenhende, S.: Multi-task learning for visual scene understanding. arXiv preprint arXiv:2203.14896 (2022)
  • [58] Vandenhende, S., Georgoulis, S., Van Gool, L.: Mti-net: Multi-scale task interaction networks for multi-task learning. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. pp. 527–543. Springer (2020)
  • [59] Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM 34(4), 30–44 (1991)
  • [60] Wang, J., Yue, Z., Zhou, S., Chan, K.C., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015 (2023)
  • [61] Wang, T., Lu, W., Zhang, K., Luo, W., Kim, T.K., Lu, T., Li, H., Yang, M.H.: Promptrr: Diffusion models as prompt generators for single image reflection removal. arXiv preprint arXiv:2402.02374 (2024)
  • [62] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
  • [63] Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European conference on computer vision (ECCV) workshops. pp. 0–0 (2018)
  • [64] Wei, G., Zhang, Z., Lan, C., Lu, Y., Chen, Z.: Activemlp: An mlp-like architecture with active token mixer. arXiv preprint arXiv:2203.06108 (2022)
  • [65] Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., Zhang, L.: Seesr: Towards semantics-aware real-world image super-resolution. arXiv preprint arXiv:2311.16518 (2023)
  • [66] Wu, Y., Li, X., Zhang, Z., Jin, X., Chen, Z.: Learned block-based hybrid image compression. IEEE Transactions on Circuits and Systems for Video Technology 32(6), 3978–3990 (2021)
  • [67] Yang, R., Timofte, R., Li, X., Zhang, Q., Zhang, L., Liu, F., He, D., Li, F., Zheng, H., Yuan, W., et al.: Aim 2022 challenge on super-resolution of compressed image and video: Dataset, methods and results. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. pp. 174–202. Springer (2023)
  • [68] Yu, T., Li, X., Cai, Y., Sun, M., Li, P.: S2-mlp: Spatial-shift mlp architecture for vision. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 297–306 (2022)
  • [69] Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: International conference on curves and surfaces. pp. 711–730. Springer (2010)
  • [70] Zhang, D.J., Li, K., Chen, Y., Wang, Y., Chandra, S., Qiao, Y., Liu, L., Shou, M.Z.: Morphmlp: A self-attention free, mlp-like backbone for image and video. arXiv preprint arXiv:2111.12527 (2021)
  • [71] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 286–301 (2018)

Appendix

Section 6 illustrates the distribution of offsets from different PTMMs and codecs.

Section 7 presents more qualitative results on various compression codecs and qualities.

6 Distribution of Offsets

We investigate the learned distributions of offsets via histogram visualization of offsets from different Prompt guided Token Mixer Modules (PTMMs). The i𝑖iitalic_i and j𝑗jitalic_j of PTMM_i_jPTMM_𝑖_𝑗\operatorname{PTMM}\_{i}\_{j}roman_PTMM _ italic_i _ italic_j denotes the offsets of jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT PTMM from ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT PTMB. We have the following observations: 1) As the depth increases, the learned offsets first expand to a larger range and then shrink to a smaller range. This hints that the model learns to extract local information for the query token at shallow layers. In the middle layers, the model leverages the offsets to aggregate the global-wise information to perform better token mixing. At the last layers, the distortions contained in image features are mostly removed, therefore the model focuses more on using local information again to refine the query tokens for the reconstruction purpose. 2) The distribution of offsets from middle layers differs among various codecs. We attribute this to the guidance from task-specific prompts. Since the distortion varies among different codecs, the visualization of learned offsets validates that our prompts are capable of providing adaptive guidance against various distortions, thus leading to better performance in the CSR tasks [67, 24]. 3) The offsets expand to a wider range for learning-based codecs compared to traditional codecs. We believe this is because the distortion introduced by learning-based codecs is more challenging to eliminate compared to that from traditional codecs, necessitating broader ranges of offsets to extract useful information for query tokens.

Refer to caption
(a) Visualization of learned offsets for JPEG [59].
Refer to caption
(b) Visualization of learned offsets for VTM [5].
Figure 5: Histograms of learned offsets for the center token from different PTMMs. LR images are randomly sampled from Urban100 [19] compressed by CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT [9] and VTM [5] codecs.
Refer to caption
(a) Visualization of learned offsets for CPSNRsubscriptCPSNR\text{C}_{\text{PSNR}}C start_POSTSUBSCRIPT PSNR end_POSTSUBSCRIPT [9].
Refer to caption
(b) Visualization of learned offsets for HIFIC [42].
Figure 6: Histograms of learned offsets for the center token from different PTMMs. LR images are randomly sampled from Urban100 [19] compressed by JPEG [59] and HIFIC [42] codecs.

7 More Visual Results

We provide more visual comparisons between our UCIP with state-of-the-art methods on different codecs and different compression qualities within a single codec. UCIP shows clearer textures and less artifacts in super-resolved images, indicating that our prompts and offsets are adaptive and robust against various degradations.

Refer to caption
Bicubic
Refer to caption
RRDB [63]
Refer to caption
SwinIR [34]
Refer to caption
Swin2SR [10]
Refer to caption
MAXIM [56]
Refer to caption
AIRNet [25]
Refer to caption
PromptIR [45]
Refer to caption
UCIP
Refer to caption
HR
Figure 7: Visual Comparisons between UCIP and other methods on HIFIC (𝒬=‘high’𝒬‘high’\mathcal{Q}=\text{`high'}caligraphic_Q = ‘high’) from Urban100 [19].
Refer to caption
Bicubic
Refer to caption
RRDB [63]
Refer to caption
SwinIR [34]
Refer to caption
Swin2SR [10]
Refer to caption
MAXIM [56]
Refer to caption
AIRNet [25]
Refer to caption
PromptIR [45]
Refer to caption
UCIP
Refer to caption
HR
Figure 8: Visual Comparisons between UCIP and other methods on HM(𝒬=32𝒬32\mathcal{Q}=32caligraphic_Q = 32) from Urban100 [19].
Refer to caption
Bicubic
Refer to caption
RRDB [63]
Refer to caption
SwinIR [34]
Refer to caption
Swin2SR [10]
Refer to caption
MAXIM [56]
Refer to caption
AIRNet [25]
Refer to caption
PromptIR [45]
Refer to caption
UCIP
Refer to caption
HR
Figure 9: Visual Comparisons between UCIP and other methods on JPEG(𝒬=30𝒬30\mathcal{Q}=30caligraphic_Q = 30) from Set14 [69].
Refer to caption
Figure 10: Visual comparisons between UCIP and other state-of-the-art methods under different compression qualities within HIFIC [42] codec. The qualities of HIFIC from top to bottom are ‘low’, ‘medium’, and ‘high’, respectively.