License: CC BY 4.0
arXiv:2404.06294v1 [eess.IV] 09 Apr 2024

Fortifying Fully Convolutional Generative Adversarial Networks for Image Super-Resolution Using Divergence Measures

Arkaprabha Basu, Kushal Bose, Sankha Subhra Mullick, Anish Chakrabarty, and Swagatam Das Arkaprabha Basu, Kushal Bose, Sankha Subhra Mullick, Anish Chakrabarty and Swagatam Das ([email protected]) are with the Electronics and Communication Sciences Unit (ECSU) and Statistics and Mathematics Unit (SMU), Indian Statistical Institute, Kolkata, India Corresponding author: Swagatam Das.
Abstract

Super-Resolution (SR) is a time-hallowed image processing problem that aims to improve the quality of a Low-Resolution (LR) sample up to the standard of its High-Resolution (HR) counterpart. We aim to address this by introducing Super-Resolution Generator (SuRGe), a fully-convolutional Generative Adversarial Network (GAN)-based architecture for SR. We show that distinct convolutional features obtained at increasing depths of a GAN generator can be optimally combined by a set of learnable convex weights to improve the quality of generated SR samples. In the process, we employ the Jensen–Shannon and the Gromov-Wasserstein losses respectively between the SR-HR and LR-SR pairs of distributions to further aid the generator of SuRGe to better exploit the available information in an attempt to improve SR. Moreover, we train the discriminator of SuRGe with the Wasserstein loss with gradient penalty, to primarily prevent mode collapse. The proposed SuRGe, as an end-to-end GAN workflow tailor-made for super-resolution, offers improved performance while maintaining low inference time. The efficacy of SuRGe is substantiated by its superior performance compared to 18 state-of-the-art contenders on 10 benchmark datasets.

Index Terms:
Generative Adversarial Networks, Image Super-Resolution, Convolutional Neural Networks, Divergence Measures

I Introduction

A Low Resolution (LR) image sacrifices information of its High Resolution (HR) counterpart in favor of general utility such as displaying or editing in smaller screens, low storage requirement, and fast transmission. Super-resolution attempts to recover the original HR copy from a LR input. However, the initial HR to LR transformation is commonly non-invertible and lossy [1]. Thus, recovering the HR by estimating a Super Resolution (SR) analog is an ill-posed problem that contains the risk of a distorted output [2].

The classical interpolation methods for super-resolution only exploit local information and are thus incapable of generating commendable SR [3]. While global image features extracted by the deep convolutional networks translate to a much improved performance [4, 5] limited generalizability and distorted SR still remain as major concerns [6].

The landscape of super-resolution techniques had a major breakthrough with the advent of the Generative Adversarial Network (GAN) [7]. A super-resolution GAN [8] embraces the canonical two-player adversarial game between a generator G𝐺Gitalic_G and a discriminator D𝐷Ditalic_D with some minor modifications. Specifically, in a super-resolution task, G𝐺Gitalic_G attempts to map a LR input to a HR ground truth, generating an estimated SR in the process. The discriminator D𝐷Ditalic_D helps G𝐺Gitalic_G by providing adversarial feedback through distinguishing between a HR ground truth and G𝐺Gitalic_G generated SR. While GAN-based super-resolution offers generalizability through their generative power they often have SR outputs that lose finer details or are plagued by artifacts [9, 10].

In this paper, we propose a GAN-based super-resolution method called Super-Resolution Generator (SuRGe). In a super-resolution task, to generate a good quality SR image, it is necessary to consider both the low-level local features (for example, colors, textures, edges, etc.) and the high-level global ones (such as individual object shapes, relative positioning of objects and background, object orientation, etc.). As noted in [11, 12], higher-level global features are progressively captured by convolutional filters residing deeper in the network. Taking inspiration from [11, 13] in the generator G𝐺Gitalic_G of the proposed SuRGe, we preserve the hierarchically complex features and dictate their flow through skip connections. However, skip connections may lead to under-utilized network capacity [14] while a potential solution like DenseNet may be challenging to train [15] with limited data. Therefore, in SuRGe, we design a generator G𝐺Gitalic_G that judiciously uses the skip connections to conserve and propagate only a few selected features that are intuitively more useful for improving the network’s performance on super-resolution task (for example, carrying forward the low-level features to recover minute details after a potentially distortion inducing up-sampling step, in a spirit similar to that of UNet [13]). To adaptively combine the features coming from different depths of the network we introduce mixing modules that operate in a learnable fashion.

Refer to caption
Figure 1: Visual comparison of 4x super-resolution outputs of the proposed SuRGe with SRGAN [8], BSRGAN [16], SWIN-IR [17], and LTE [18], given a low-resolution (LR) input image patch. SuRGe is producing better super-resolution images with finer texture, color, and intricate details.

We further focus on the fact that ideally the distributions of SR and HR should be identical. Thus, a loss function like Jensen–Shannon (JS) divergence that explicitly encourages minimizing the dissimilarity between the respective distributions of HR and SR, helps in training the generator G𝐺Gitalic_G in SuRGe. Moreover, in the ideal case, LR and SR should also preserve structural similarities, which consequently gets reflected in their corresponding distributions. However, LR and SR reside in different metric spaces with potentially distinct dimensionality. Thus, to explicitly minimize the discrepancy between the respective distributions of LR and SR we further utilize the Gromov Wasserstein (GW) distance [19] as an additional loss function in the generator G𝐺Gitalic_G of SuRGe. As per our knowledge, this is the first time the applicability of explicit divergence measures is explored in the context of GAN-based super-resolution techniques. We also employ a dynamically weighted convex combination strategy of the multiple losses in G𝐺Gitalic_G [20] of SuRGe. Furthermore, to prevent G𝐺Gitalic_G from mode collapsing, especially on the smaller training sets used in super-resolution [21] we employ Wasserstein loss with gradient penalty (WGAN-GP) [22] to train the discriminator D𝐷Ditalic_D in SuRGe.

The primary contributions of our fully-convolutional GAN-based SR method SuRGe are as follows.

  • To the best of our knowledge, SuRGe is the first super-resolution model that introduces GW, a divergence between metric spaces of potentially different dimensions, to fuel the learning of generator G𝐺Gitalic_G. This incorporation of LR-SR relationship directly endows SuRGe with authentic super-resolution capabilities.

  • Moving away from pre-trained model-biased perceptual similarity, SuRGe takes JS divergence as an additional loss of generator G𝐺Gitalic_G (alongside adversarial and GW) while discriminator D𝐷Ditalic_D uses gradient penalized Wasserstein loss to improve the SR.

  • We introduce a generator G𝐺Gitalic_G in SuRGe that efficiently employs skip connections to garner semantic information from different levels of feature representation and supports their adaptive mixing in a learnable fashion.

The effect of these critical improvements is evident in the motivational example in Figure 1, where compared to four notable contenders, the SR obtained by SuRGe is richer in finer details and most closely matches the HR. Following a brief review of the existing deep super-resolution strategies using Convolutional Neural Networks and Transformers in Section II, we detail the proposed methodology in Section III. In Section IV we show that the proposed SuRGe outperforms the current best by an average of 3.51% and 5.45% respectively in terms of PSNR [23] and SSIM [24] on four common benchmarks for 4x super-resolution. Further, SuRGe supersedes the state-of-the-arts by 15.19% in terms of PSNR on six complex 4x super-resolution datasets.

II Related Works

Deep super-resolution models can be Convolutional Neural Networks (CNNs), GANs, and, more recently transformers. The CNN-based methods are the first to employ deep networks for super-resolution [6], using convolution maps through the image followed by interpolation methodologies similar to a typical convolutional autoencoder. Though innovative for initial study and duly credited for a remarkable improvement over the traditional techniques, such CNN architectures suffer from poor generalizability and thus are domain-dependent [4]. As a remedy, WDRN [25], employs distinct wavelet features and their adaptive mixture for a better super resolution performance.

The shortcomings of CNN can be addressed using GANs [8] with task-specific modifications. This route of research mainly diverges into three primary avenues. First, removing normalization and introducing dense blocks in generator [26], indeed improve the SR image quality although with a greater computational cost. Second, replacing dense networks with residual backbone [27] utilizing skip connections. Even though, such networks are considerably easier to train, their full potential may not be realized without carefully curating the skip connections and feature mixing that best aids the super-resolution task. Third, additional preprocessing such as blurring and specialized noise injection [28, 16] produces augmentations that are close to real-life scenarios and can lead to more enhanced SR output. Unfortunately, this also purposefully distorts the input distribution that may in turn sacrifice the clarity and details. On the other hand, considering perceptual similarity is introduced in [29] that directly depends upon the generalization of an external pre-trained network for optimizing the generator. In summary, such methods typically suffer from loss of minute details [16] or distorted boundaries [29]. Moreover, all of these methods perform the required up-scaling at once at the end of the network. Thus, any possible distortion during the drastic up-scaling cannot be mitigated by the network. Furthermore, the ever-improving GAN variants remain mostly unexplored in the context of super-resolution.

SAN [30] introduces channel attention in CNN to open the gate for transformer networks in super-resolution. In support of a better performance, transformers not only manage to adaptively mix diversely informative features through attention [17, 31] but also mitigate SR output distortions using layer normalization. In [32], the idea of cross-attention is proposed which was later improved in [33] to mitigate the adverse impacts of uncontrolled mixing of distinct features. The DAT [34] introduces a novel transformer model that aggregates features through inter-block spatial and intra-block channel attentions. In essence, they introduce Adaptive Interaction Module (AIM) and the Spatial-Gate Feed-Forward network (SGFN) for a tailored feature aggregation at different level. Later, SR-Former [35] attempts to improve [34] by focusing on Permuted Self Attention (PSA) for a more balanced approach towards feature aggregation through channel and spatial attentions.

III Proposed Method

Typically, image matrix (or vector if flattened) resides in lower-dimensional ambient space 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [36] i.e. a low-resolution (LR) image 𝐱1𝐱subscript1\mathbf{x}\in\mathcal{M}_{1}bold_x ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Thus, a higher-resolution (HR) version of 𝐱𝐱\mathbf{x}bold_x exists as 𝐲2𝐲subscript2\mathbf{y}\in\mathcal{M}_{2}bold_y ∈ caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that improves definition. The estimated form of 𝐲𝐲\mathbf{y}bold_y is known as the super-resolution output SR. Commonly 1whcsubscript1superscript𝑤𝑐\mathcal{M}_{1}\subset\mathbb{R}^{whc}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_w italic_h italic_c end_POSTSUPERSCRIPT and 2whcsubscript2superscriptsuperscript𝑤superscript𝑐\mathcal{M}_{2}\subset\mathbb{R}^{w^{\prime}h^{\prime}c}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, where w=wrsuperscript𝑤𝑤𝑟w^{\prime}=writalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w italic_r, h=hrsuperscript𝑟h^{\prime}=hritalic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_h italic_r, and r+𝑟superscriptr\in\mathbb{Z}^{+}italic_r ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the multiplicative scaling factor [37] denoting the extent of magnification from LR to HR (or SR). A GAN-based super-resolution method given input 𝐱𝐱\mathbf{x}bold_x searches for a generator G{Gθ:12|θΘ}𝐺conditional-setsubscript𝐺𝜃subscript1conditionalsubscript2𝜃ΘG\in\{G_{\theta}:\mathcal{M}_{1}\rightarrow\mathcal{M}_{2}|\theta\in\Theta\}italic_G ∈ { italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_θ ∈ roman_Θ } that minimizes the discrepancy between SR G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) and its HR analog 𝐲𝐲\mathbf{y}bold_y. A discriminator D𝐷Ditalic_D guides G𝐺Gitalic_G by providing feedback through distinguishing 𝐲𝐲\mathbf{y}bold_y and G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ). We further denote the distributions of LR, HR, and SR as p𝐱subscript𝑝𝐱p_{\mathbf{x}}italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, p𝐲subscript𝑝𝐲p_{\mathbf{y}}italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT, and pG(𝐱)subscript𝑝𝐺𝐱p_{G(\mathbf{x})}italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT respectively.

Refer to caption
Figure 2: The schematic of SuRGe demonstrates two of its main components in (a) the generator G𝐺Gitalic_G and (b) the discriminator D𝐷Ditalic_D. Moreover, in (c), we detail the structure of our sub-network Repetitive Residual Block used in G𝐺Gitalic_G and D𝐷Ditalic_D. G𝐺Gitalic_G takes a LR image 𝐱𝐱\mathbf{x}bold_x and generates a 4x up-scaled SR image G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ). D𝐷Ditalic_D guides G𝐺Gitalic_G by distinguishing an input between HR ground truth 𝐲𝐲\mathbf{y}bold_y and SR G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ). Further details on network design can be found in the Detailed architecture of the SuRGe network.

III-A The Architecture of G𝐺Gitalic_G

The generator G𝐺Gitalic_G aims to recover 𝐲𝐲\mathbf{y}bold_y from 𝐱𝐱\mathbf{x}bold_x under the commonly used constraint of r=4𝑟4r=4italic_r = 4 i.e. through 4x super-resolution [6, 8, 26, 17, 32]. Unlike popular practice [17, 32] of performing 4x up-scaling in one shot at the end, G𝐺Gitalic_G in SuRGe performs the same in two steps i.e. a 2x up-scaling (see Figure 2) at the end of each half. This way, the right half of G𝐺Gitalic_G can mitigate the possible abrupt distortions of the feature space due to the first 2x up-scaling and reuse features from the left half (through skip connections) to recover corrupted information. We demonstrate G𝐺Gitalic_G in Figure 2 highlighting the key components while detailing them individually in the following.

The initial convolution block (C0)subscript𝐶0(C_{0})( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) extracts the low-level features using larger kernels with half-padding, providing two benefits. (1) Repetitive information and their variation over a larger region can be better captured [38]. (2) Possible distortions near the image boundaries can be avoided [6]. Moreover, we use parametric ReLU to allow the distinct layers to have different non-linearity for better conservation of low-level features. Further, we discard normalization to avoid information loss through regularization and scaling.

The repetitive residual generator block (R0)subscript𝑅0(R_{0})( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) focus on extracting high-level intricate features using smaller kernels. This contains nGsubscript𝑛𝐺n_{G}italic_n start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT residual blocks [11], each having two sub-blocks. The first sub-block alone uses parametric ReLU activation while both employ batch normalization to induce regularization and limit covariance shift. The outputs of the two sub-blocks are added through a skip connection and passed to the next residual block. The inter-sub-block skip connection ensures that the features after each convolution and normalization at least retain the extracted information, if not able to enrich it further.

The nGsubscript𝑛𝐺n_{G}italic_n start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT-th block of R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sums the outputs of its two sub-blocks. This distorted feature space, similar to a typical ResNet, must be further stabilized before being processed in the next stage. However, the common remedy of average pooling fails in super-resolution because neither such a kernel is learned nor the down-scaling goes along with the task objective. Therefore, the intermediate convolution block (I0)subscript𝐼0(I_{0})( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is added to stabilize the output of R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by additional convolutions with batch normalization.

At the outputs of C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we respectively have low and high level features. Thus, we use the first weighted feature mixing module F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to combine these two features before up-scaling. F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT performs a simple convex combination as:

F0=w1(F0)C0+w2(F0)I0(R0(C0)),subscript𝐹0subscriptsuperscript𝑤subscript𝐹01subscript𝐶0subscriptsuperscript𝑤subscript𝐹02subscript𝐼0subscript𝑅0subscript𝐶0F_{0}=w^{(F_{0})}_{1}C_{0}+w^{(F_{0})}_{2}I_{0}(R_{0}(C_{0})),italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , (1)

where w1(F0),w2(F0)>0subscriptsuperscript𝑤subscript𝐹01subscriptsuperscript𝑤subscript𝐹020w^{(F_{0})}_{1},w^{(F_{0})}_{2}>0italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and w1(F0)+w2(F0)=1subscriptsuperscript𝑤subscript𝐹01subscriptsuperscript𝑤subscript𝐹021w^{(F_{0})}_{1}+w^{(F_{0})}_{2}=1italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1. We learn both of w1(F0)subscriptsuperscript𝑤subscript𝐹01w^{(F_{0})}_{1}italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2(F0)subscriptsuperscript𝑤subscript𝐹02w^{(F_{0})}_{2}italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as parameters of G𝐺Gitalic_G while the convexity constraint is ensured by passing the weights through a Softmax activation.

Refer to caption
Figure 3: In comparison to HR patch (a) of a butterfly image, the checkerboard pattern introduced by PixelShuffle [9] is apparent in (b). Nearest neighbor up-scaling in SuRGe generates clean G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) as evident from our result in (c).

The output of F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is first stabilized with convolution and then passed to U0subscript𝑈0U_{0}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for 2x up-scaling. The commonly used up-scaling techniques such as PixelShuffle and transposed convolutions, though effective otherwise, are likely to distort SR as in Figure 3. This is because the overlapped kernels may introduce uneven convolution that results in higher frequency color patterns like a checkerboard in the border pixels of the kernel mapping. Hence, we employ the interpolation-based nearest neighbors method [1] for up-scaling.

Except for F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the rest of the right half of G𝐺Gitalic_G is identical to the left. The right or second half of G𝐺Gitalic_G starts with an initial convolution layer C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that stabilizes the U0subscript𝑈0U_{0}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT output. In an attempt to recover any lost or corrupted information at this stage, we add the output of C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the 2x up-sampled output of C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT after passing it through a skip block S𝑆Sitalic_S. The sum is then propagated through R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT layers. F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT takes the following three inputs: (1) C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT output 2x up-scaled by a skip block S𝑆Sitalic_S having a structure similar to UC𝑈𝐶U\circ Citalic_U ∘ italic_C that recovers the low-level features at the end of the network. (2) The output of I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. (3) The output of C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Similar to F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT here also we perform a convex combination of the three inputs as follows:

F1=w1(F1)I1+w2(F1)C1+w3(F1)S(C0),subscript𝐹1subscriptsuperscript𝑤subscript𝐹11subscript𝐼1subscriptsuperscript𝑤subscript𝐹12subscript𝐶1subscriptsuperscript𝑤subscript𝐹13𝑆subscript𝐶0F_{1}=w^{(F_{1})}_{1}I_{1}+w^{(F_{1})}_{2}C_{1}+w^{(F_{1})}_{3}S(C_{0}),italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_S ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (2)

where the three weights w1(F1)subscriptsuperscript𝑤subscript𝐹11w^{(F_{1})}_{1}italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2(F1)subscriptsuperscript𝑤subscript𝐹12w^{(F_{1})}_{2}italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and w3(F1)subscriptsuperscript𝑤subscript𝐹13w^{(F_{1})}_{3}italic_w start_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are constrained with convexity similarly to their counterparts in F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and are thus learned in the same way. The output of F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is 2x up-scaled by U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and stabilized by further convolutions (without batch normalization or activation to avoid distortion) to produce the SR output G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ).

III-B The Architecture of D𝐷Ditalic_D

As demonstrated in Figure 2, D𝐷Ditalic_D has two main components: a sub-network called Repetitive Residual Discriminator Block B𝐵Bitalic_B and a classification head H𝐻Hitalic_H. The structure of B𝐵Bitalic_B mostly follows R𝑅Ritalic_R as it contains nDsubscript𝑛𝐷n_{D}italic_n start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT residual blocks, each with two sub-blocks connected by an inter-sub-block skip connection. Maintaining near structural similarity between G𝐺Gitalic_G and D𝐷Ditalic_D enables the same input to likely have close embeddings in the learned feature space. Thus, D𝐷Ditalic_D can easily identify a deviation of G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) from 𝐲𝐲\mathbf{y}bold_y and improve G𝐺Gitalic_G through a more useful feedback.

There are three key differences between R𝑅Ritalic_R and B𝐵Bitalic_B. (1) B𝐵Bitalic_B uses LeakyReLU activation to prevent sparse or scattered gradients [7]. (2) B𝐵Bitalic_B employs Pixel normalization [39], as batch normalization is known to cause quality issues in a super-resolution task when used in D𝐷Ditalic_D [10, 26]. (3) The number of filters and the convolution kernel size are gradually increased over the residual blocks. This not only improves the balanced capture of low-level and high-level information but also prevents over-fitting by removing bias to a particular kernel size.

The classification head H𝐻Hitalic_H first performs an adaptive average pooling on the output of B𝐵Bitalic_B. The pooled features are then flattened and passed through dense layers with LeakyReLU activation. The final dense layer maps the features to a single node and applies Sigmoid activation on the logit to find the probability of the input being HR ground truth.

III-C Loss functions of SuRGe

SuRGe, embodying the GAN philosophy, has tailor-made losses for the generator G𝐺Gitalic_G and the discriminator D𝐷Ditalic_D.

III-C1 Loss for generator G𝐺Gitalic_G:

To receive guidance from D𝐷Ditalic_D, G𝐺Gitalic_G utilizes a traditional adversarial loss aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT defined as:

aG=𝐱NlogD(G(𝐱)),subscriptsuperscript𝐺𝑎subscript𝐱𝑁𝐷𝐺𝐱\mathcal{L}^{G}_{a}=-\sum\nolimits_{\mathbf{x}\in N}\log{D(G(\mathbf{x}))},caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT bold_x ∈ italic_N end_POSTSUBSCRIPT roman_log italic_D ( italic_G ( bold_x ) ) , (3)

where N𝑁Nitalic_N is a training batch.

The classical aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, though necessary, is not sufficient for maintaining the desired perceptual quality of the SR output, when deployed alone. A common remedy [8, 27] is to additionally minimize the discrepancy between HR and SR in the embedding space of a pre-trained deep network that is likely capable of expressing perceptual information. Evidently, the efficacy of such a loss is reliant on the quality and generalizability of the pre-trained embedding space [26]. However, p𝐲subscript𝑝𝐲p_{\mathbf{y}}italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT and pG(𝐱)subscript𝑝𝐺𝐱p_{G(\mathbf{x})}italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT, respectively the distributions of HR and SR, in practice are supported on the same ambient space. Hence, directly minimizing their divergence using a symmetric measure like JS motivates G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) to resemble 𝐲𝐲\mathbf{y}bold_y:

JSG=12𝔼p𝐲[log(p𝐲)log((p𝐲+pG(𝐱))2)]+12𝔼pG(𝐱)[log(pG(𝐱))log(1(pG(𝐱)+p𝐲)2)].subscriptsuperscript𝐺JS12subscript𝔼subscript𝑝𝐲delimited-[]subscript𝑝𝐲subscript𝑝𝐲subscript𝑝𝐺𝐱212subscript𝔼subscript𝑝𝐺𝐱delimited-[]subscript𝑝𝐺𝐱1subscript𝑝𝐺𝐱subscript𝑝𝐲2\mathcal{L}^{G}_{\textrm{JS}}=\frac{1}{2}\mathbb{E}_{p_{\mathbf{y}}}\left[\log% (p_{\mathbf{y}})-\log\left(\frac{(p_{\mathbf{y}}+p_{G(\mathbf{x})})}{2}\right)% \right]+\frac{1}{2}\mathbb{E}_{p_{G(\mathbf{x})}}\left[\log(p_{G(\mathbf{x})})% -\log\left(\frac{1}{(p_{G(\mathbf{x})}+p_{\mathbf{y}})}{2}\right)\right].caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ) - roman_log ( divide start_ARG ( italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT ) - roman_log ( divide start_ARG 1 end_ARG start_ARG ( italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ) end_ARG 2 ) ] . (4)

As the name suggests, at the heart of the super-resolution problem lies the task of learning a meaningful transformation G𝐺Gitalic_G that refines LR images visually. The optimization, however, is constrained based on the need to preserve semantic features. Such information in a set of samples is stored not only in coordinate entries of the vectors but also into their local geometry. As such, a generative model becomes a true SR architecture on the basis of its capacity to keep the metric measure spaces corresponding to p𝐱subscript𝑝𝐱p_{\mathbf{x}}italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT and pG(𝐱)subscript𝑝𝐺𝐱p_{G(\mathbf{x})}italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT near-isometric. The divergence that enables penalizing the deviation from such an ideal scenario is GW. Thus in SuRGe, we integrate the GW loss in training G𝐺Gitalic_G:

GWG=minγΓ|d1(𝐱,𝐱~)d2(𝐳,𝐳~)|2𝑑γ(𝐱,𝐳)𝑑γ(𝐱~,𝐳~),subscriptsuperscript𝐺GWsubscript𝛾Γsuperscriptsubscript𝑑1𝐱~𝐱subscript𝑑2𝐳~𝐳2differential-d𝛾𝐱𝐳differential-d𝛾~𝐱~𝐳\mathcal{L}^{G}_{\textrm{GW}}=\min_{\gamma\in\Gamma}\int|d_{1}(\mathbf{x},% \mathbf{\tilde{x}})-d_{2}(\mathbf{z},\mathbf{\tilde{z}})|^{2}d\gamma(\mathbf{x% },\mathbf{z})d\gamma(\mathbf{\tilde{x}},\mathbf{\tilde{z}}),caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_γ ∈ roman_Γ end_POSTSUBSCRIPT ∫ | italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x , over~ start_ARG bold_x end_ARG ) - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_z , over~ start_ARG bold_z end_ARG ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_γ ( bold_x , bold_z ) italic_d italic_γ ( over~ start_ARG bold_x end_ARG , over~ start_ARG bold_z end_ARG ) , (5)

where ΓΓ\Gammaroman_Γ is the set of couplings between distributions p𝐱subscript𝑝𝐱p_{\mathbf{x}}italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT and pG(𝐱)subscript𝑝𝐺𝐱p_{G(\mathbf{x})}italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT, while 𝐱,𝐱¯p𝐱similar-to𝐱¯𝐱subscript𝑝𝐱\mathbf{x},\bar{\mathbf{x}}\sim p_{\mathbf{x}}bold_x , over¯ start_ARG bold_x end_ARG ∼ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, and 𝐳,𝐳¯pG(𝐱)similar-to𝐳¯𝐳subscript𝑝𝐺𝐱\mathbf{z},\bar{\mathbf{z}}\sim p_{G(\mathbf{x})}bold_z , over¯ start_ARG bold_z end_ARG ∼ italic_p start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT. Also, d1,d2subscript𝑑1subscript𝑑2d_{1},d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the metrics on the spaces 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively.

Tuning a set of static weights to combine the three loss components in Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is not only tedious but also inefficient due to being oblivious to dynamic training situations. Learning the weights as network parameters may also bias the training towards a particular component. Thus, we employ a convex combination of the three loss components where the weights are dynamically calculated [20]. Specifically, the values of the three loss components are passed through Softmax. As such, the dynamically assigned weight to a loss component depends on its value such that at any point in time, the weights adjust according to the values for preventing the dominance of one on the others in the combined Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. In essence, at each iteration of training:

G=waaG+wJSJSG+wGWGWG,wherew()=exp(()G)(exp(aG)+exp(JSG)+exp(GWG)).formulae-sequencesuperscript𝐺subscript𝑤𝑎superscriptsubscript𝑎𝐺subscript𝑤JSsuperscriptsubscriptJS𝐺subscript𝑤GWsuperscriptsubscriptGW𝐺wheresubscript𝑤superscriptsubscript𝐺superscriptsubscript𝑎𝐺superscriptsubscriptJS𝐺superscriptsubscriptGW𝐺\mathcal{L}^{G}=w_{a}\mathcal{L}_{a}^{G}+w_{\textrm{JS}}\mathcal{L}_{\textrm{% JS}}^{G}+w_{\textrm{GW}}\mathcal{L}_{\textrm{GW}}^{G},\;\text{where}\;w_{(% \cdot)}=\frac{\exp({\mathcal{L}_{(\cdot)}^{G}})}{(\exp({\mathcal{L}_{a}^{G}})+% \exp({\mathcal{L}_{\textrm{JS}}^{G}})+\exp({\mathcal{L}_{\textrm{GW}}^{G}}))}.caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT , where italic_w start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT = divide start_ARG roman_exp ( caligraphic_L start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) end_ARG start_ARG ( roman_exp ( caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) + roman_exp ( caligraphic_L start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) + roman_exp ( caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ) end_ARG . (6)

Now w()subscript𝑤w_{(\cdot)}italic_w start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT can be wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, wJSsubscript𝑤JSw_{\textrm{JS}}italic_w start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT and wGWsubscript𝑤GWw_{\textrm{GW}}italic_w start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT while ()Gsuperscriptsubscript𝐺\mathcal{L}_{(\cdot)}^{G}caligraphic_L start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT is respectively set to aGsuperscriptsubscript𝑎𝐺\mathcal{L}_{a}^{G}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, JSGsuperscriptsubscriptJS𝐺\mathcal{L}_{\textrm{JS}}^{G}caligraphic_L start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, and GWGsuperscriptsubscriptGW𝐺\mathcal{L}_{\textrm{GW}}^{G}caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT.

III-C2 Loss for discriminator D𝐷Ditalic_D:

We draw inspiration from WGANs’ promise of improving generation quality by deploying the Wasserstein-1111 distance (WD) to distinguish between ‘real’ and ‘fake’ samples. The underlying class of critic functions (them being k𝑘kitalic_k-Lipschitz continuous) additionally mollify mode collapse [40]. However, maintaining k𝑘kitalic_k-Lipschitz continuity during training is difficult as it requires limiting the gradients of D𝐷Ditalic_D. WGAN achieves this by a weight-clipping heuristic that sacrifices complexity. As a better alternative, WGAN-GP puts a constraint on the gradient itself that can be expressed as a regularizer called gradient penalty. Thus, Dsuperscript𝐷\mathcal{L}^{D}caligraphic_L start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT can be written as follows:

D=𝔼p𝐱D(G(𝐱))𝔼p𝐲D(𝐲)+λ𝔼(𝐱^D(𝐱^)21)2,superscript𝐷subscript𝔼subscript𝑝𝐱𝐷𝐺𝐱subscript𝔼subscript𝑝𝐲𝐷𝐲𝜆𝔼superscriptsubscriptnormsubscript^𝐱𝐷^𝐱212\mathcal{L}^{D}=\mathbb{E}_{p_{\mathbf{x}}}D(G(\mathbf{x}))-\mathbb{E}_{p_{% \mathbf{y}}}D(\mathbf{y})+\lambda\mathbb{E}(||\nabla_{\hat{\mathbf{x}}}D(\hat{% \mathbf{x}})||_{2}-1)^{2},caligraphic_L start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( italic_G ( bold_x ) ) - blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( bold_y ) + italic_λ blackboard_E ( | | ∇ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT italic_D ( over^ start_ARG bold_x end_ARG ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where 𝐱^=ϵ𝐲+(1ϵ)G(𝐱)^𝐱italic-ϵ𝐲1italic-ϵ𝐺𝐱\hat{\mathbf{x}}=\epsilon\mathbf{y}+(1-\epsilon)G(\mathbf{x})over^ start_ARG bold_x end_ARG = italic_ϵ bold_y + ( 1 - italic_ϵ ) italic_G ( bold_x ), and ϵUniform(0,1)similar-toitalic-ϵUniform01\epsilon\sim\textrm{Uniform}(0,1)italic_ϵ ∼ Uniform ( 0 , 1 ).

Refer to caption
Figure 4: We extract patch from HR as 𝐲𝐲\mathbf{y}bold_y and down-scale it to LR input 𝐱𝐱\mathbf{x}bold_x. The 𝐱𝐱\mathbf{x}bold_x is fed to G𝐺Gitalic_G to obtain the 4x SR G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ). The G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) is used for aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT using equation (3), G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) and 𝐲𝐲\mathbf{y}bold_y together is used for JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT using equation (4), and 𝐱𝐱\mathbf{x}bold_x with G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) find GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT using equation (5). We take the Softmax-based dynamic convex combination of aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT, and GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT as per equation (6) to find Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT to update G𝐺Gitalic_G. For updating D𝐷Ditalic_D, we use 𝐲𝐲\mathbf{y}bold_y and G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) to calculate Dsuperscript𝐷\mathcal{L}^{D}caligraphic_L start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT using equation (7).

III-D Putting it all together

The workflow of SuRGe is illustrated in Figure 4 while the algorithm is described in Algorithm 1 in Algorithm of SuRGe. We follow a patch-based training [37, 8] in SuRGe. The idea is to extract 256×256256256256\times 256256 × 256 overlapped patch of the HR ground truth as 𝐲𝐲\mathbf{y}bold_y and 4x bi-cubic down-scale the same to 64×64646464\times 6464 × 64 to get the corresponding LR as 𝐱𝐱\mathbf{x}bold_x. The training strategy of SuRGe is similar to a vanilla GAN [7]. Thus, G𝐺Gitalic_G and D𝐷Ditalic_D are alternatively updated with the gradients of the respective Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and Dsuperscript𝐷\mathcal{L}^{D}caligraphic_L start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT loss.

IV Experiments

IV-A Experimental Protocol

Following standard practice, we train SuRGe on DIV2K [21] dataset using the 800 training examples. We test SuRGe on four popular benchmarks namely Set5 [41], Set14 [42], BSD100 [43], and Urban100 [4] along with six additional datasets viz. Kitti2012, Kitti2015 [44], Middlebury [45], PIRM [46], OST300 [47] and MANGA109 [48]. The details of network architecture, datasets, pre-processing of input, the hyper-parameters choices, and their tuning with grid search are provided respectively in Detailed architecture of the SuRGe network, Description of datasets, and Network architecture selection by grid search. We use PSNR [23] and SSIM [24] to measure the performances of all the methods both of which are described in Metrics. The code for SuRGe is currently provided as a supplementary archieved file to this article, once accepted for publication the code-base will be uploaded to a public GitHub repository for ease of access and result reproduction.

TABLE I: Ablation study of SuRGe on BSD100 dataset in terms of PSNR and SSIM. The gradual improvement in performance with the progressive addition of key components through five intermediate models V0V4subscript𝑉0subscript𝑉4V_{0}-V_{4}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT validate their importance in SuRGe.
Model DVGGsubscript𝐷VGGD_{\textrm{VGG}}italic_D start_POSTSUBSCRIPT VGG end_POSTSUBSCRIPT DRESsubscript𝐷RESD_{\textrm{RES}}italic_D start_POSTSUBSCRIPT RES end_POSTSUBSCRIPT D3Ksubscript𝐷3KD_{3\textrm{K}}italic_D start_POSTSUBSCRIPT 3 K end_POSTSUBSCRIPT DIKsubscript𝐷IKD_{\textrm{IK}}italic_D start_POSTSUBSCRIPT IK end_POSTSUBSCRIPT pGsubscriptsuperscript𝐺𝑝\mathcal{L}^{G}_{p}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT DS GCCsubscript𝐺CCG_{\textrm{CC}}italic_G start_POSTSUBSCRIPT CC end_POSTSUBSCRIPT GF0,1subscript𝐺subscript𝐹01G_{F_{0,1}}italic_G start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝒲tGsubscriptsuperscript𝒲𝐺𝑡\mathcal{W}^{G}_{t}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 𝒲lGsubscriptsuperscript𝒲𝐺𝑙\mathcal{W}^{G}_{l}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 𝒲dwGsubscriptsuperscript𝒲𝐺𝑑𝑤\mathcal{W}^{G}_{dw}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_w end_POSTSUBSCRIPT PSNR11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT SSIM11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 29.6129.6129.6129.61 0.760.760.760.76
V1subscript𝑉1V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 30.0430.0430.0430.04 0.810.810.810.81
V2subscript𝑉2V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 30.1430.1430.1430.14 0.810.810.810.81
V3subscript𝑉3V_{3}italic_V start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT RN 29.8529.8529.8529.85 0.830.830.830.83
V4subscript𝑉4V_{4}italic_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT FI 30.9930.9930.9930.99 0.840.840.840.84
V5subscript𝑉5V_{5}italic_V start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT RN 30.1630.1630.1630.16 0.830.830.830.83
V6subscript𝑉6V_{6}italic_V start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT FI 31.2931.2931.2931.29 0.860.860.860.86
SuRGe FI 31.52 0.87
  • DVGGsubscript𝐷VGGD_{\textrm{VGG}}italic_D start_POSTSUBSCRIPT VGG end_POSTSUBSCRIPT: D𝐷Ditalic_D with VGG-type backbone. DRESsubscript𝐷RESD_{\textrm{RES}}italic_D start_POSTSUBSCRIPT RES end_POSTSUBSCRIPT: D𝐷Ditalic_D with ResNet-type network. D3Ksubscript𝐷3KD_{3\textrm{K}}italic_D start_POSTSUBSCRIPT 3 K end_POSTSUBSCRIPT: Kernel size is set to 3 in D𝐷Ditalic_D. DIKsubscript𝐷IKD_{\textrm{IK}}italic_D start_POSTSUBSCRIPT IK end_POSTSUBSCRIPT: D𝐷Ditalic_D with incremental kernel size. pGsubscriptsuperscript𝐺𝑝\mathcal{L}^{G}_{p}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT: Perceptual similarity calculated with ResNet-50 is used as a loss. JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT: Jensen-Shannon Divergence calculated between SR and HR image batch. GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT: Gromov-Wasserstein Loss calculated between LR and SR image batch. DS: Space used for divergence measures, can be ResNet-50 (RN) or Flattened Image (FI). GCCsubscript𝐺CCG_{\textrm{CC}}italic_G start_POSTSUBSCRIPT CC end_POSTSUBSCRIPT: During combinations the feature from the main path is taken entirety while the skip connection weights are manually tuned. GF0,1subscript𝐺subscript𝐹01G_{F_{0,1}}italic_G start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: G𝐺Gitalic_G using the F0,1subscript𝐹01F_{0,1}italic_F start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT in SuRGe. 𝒲tGsubscriptsuperscript𝒲𝐺𝑡\mathcal{W}^{G}_{t}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: Summing aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT, GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT in the Generator Loss Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. 𝒲lGsubscriptsuperscript𝒲𝐺𝑙\mathcal{W}^{G}_{l}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT: Mixing aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT, GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT in the Generator Loss Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT with weights learned as parameters of the network. 𝒲dwGsubscriptsuperscript𝒲𝐺𝑑𝑤\mathcal{W}^{G}_{dw}caligraphic_W start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_w end_POSTSUBSCRIPT: Mixing aGsubscriptsuperscript𝐺𝑎\mathcal{L}^{G}_{a}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT, GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT with the dynamic weighting in the Generator Loss Gsuperscript𝐺\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT: Increment indicates improvement.

Refer to caption
Figure 5: We show the generated SR patch for a butterfly (Set5) test instance along with the metrics (on top as PSNR/SSIM) in the intervals of every 50 training epochs of SuRGe. The gradual improvement in the SR output of SuRGe is apparent with the progress in training.

IV-B Ablation study

We start with an ablation study of the five critical components in SuRGe, namely the choice of backbone in D𝐷Ditalic_D, the kernel size for convolution in D𝐷Ditalic_D, the choice of loss functions in G𝐺Gitalic_G, the combination strategy of F0,1subscript𝐹01F_{0,1}italic_F start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT in G𝐺Gitalic_G, and the loss function in D𝐷Ditalic_D. Table I shows that over the seven intermediate models on the BSD100 dataset, the performance gradually improves in terms of PSNR and SSIM with better choices for the components. The best performance is achieved when all the components act in harmony, validating their importance in SuRGe. Moreover, previous uses of GW [49] argued in favor of representations obtained from a pre-trained deep network to limit the dimensions and improve stability. However, the particular task of super-resolution may benefit from the flattened images as that mitigate the risk of unregulated information alteration through feature extraction. We empirically confirm this through a comparison of V3subscript𝑉3V_{3}italic_V start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and V5subscript𝑉5V_{5}italic_V start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT passing image features extracted from a pre-trained ResNet-50 against V4subscript𝑉4V_{4}italic_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and V6subscript𝑉6V_{6}italic_V start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT feeding flattened images to the JSGsubscriptsuperscript𝐺JS\mathcal{L}^{G}_{\textrm{JS}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT JS end_POSTSUBSCRIPT and GWGsubscriptsuperscript𝐺GW\mathcal{L}^{G}_{\textrm{GW}}caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT. Our experiment shows that using flattened images gives performance boost compared to ResNet-50 embedding space. As an extension, in Figure 5, we further show that the SR output generated by SuRGe progressively improves over training.

TABLE II: Performance comparison of SuRGe in terms of PSNR and SSIM on four benchmarks against notable competitors.
Method Strategy22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Set511{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Set1411{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT BSD10011{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Urban10011{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
PSNR33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT SSIM33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT PSNR SSIM PSNR SSIM PSNR SSIM
SRCNN CNN 30.49 0.86 27.50 0.75 26.91 0.71 24.53 0.72
SelfExSR CNN -- -- -- -- 26.80 0.71 24.67 0.73
DBPN-RES-MR64-3 CNN 32.65 0.90 29.03 0.79 27.82 0.74 27.08 0.81
SRGAN GAN 29.40 0.85 26.02 0.74 23.16 0.67 -- --
ProSR-L GAN -- -- 28.94 -- 27.68 -- 26.74 --
ESRGAN GAN 32.73 0.90 28.99 0.79 27.85 0.75 27.03 0.82
RankSRGAN GAN -- -- 26.57 0.65 25.57 0.65 -- --
Beby-GAN GAN 27.82 0.80 26.96 0.73 25.81 0.68 25.72 0.77
Gram-GAN GAN 27.97 0.80 26.96 0.77 26.32 0.74 25.89 0.77
SAN CNNA 32.70 0.90 29.05 0.79 27.86 0.75 27.23 0.82
WRAN CNNA 28.60 0.90 28.60 0.79 27.71 0.74 26.74 0.80
SwinIR TRAN 32.92 0.90 29.09 0.79 27.92 0.74 27.45 0.82
SwinIR+ TRAN 32.93 0.90 29.15 0.79 27.95 0.75 27.56 0.83
SwinFIR TRAN 33.20 0.91 29.36 0.79 28.03 0.75 28.12 0.84
LTE TRAN 32.81 -- 29.06 -- 27.86 -- 27.24 --
HAT-L TRAN 33.30 0.90 29.47 0.80 28.09 0.76 28.60 0.85
DAT+ TRAN 33.15 0.91 29.29 0.80 28.03 0.75 27.99 0.84
SRFormer+ TRAN 33.09 0.91 29.19 0.80 28.00 0.75 27.85 0.84
SuRGe (Ours) GAN 33.07 0.91 30.21 0.83 31.52 0.87 30.11 0.90
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Boldfaced: best, Underlined: second best. 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT CNNA: CNN+Attention, TRAN: Transformer.
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Increment indicates improvement.
TABLE III: Performance comparison of SuRGe in terms of PSNR and SSIM on six additional test datasets.
Dataset Method PSNR(\uparrow) SSIM(\uparrow)
PIRM ESRGAN+ 24.15 --
RankSRGAN 25.62 --
SuRGe (Ours) 31.92 0.90
OST300 ESRGAN+ 23.80 --
SuRGe (Ours) 31.01 0.86
Kitti2012 NAFSSR-L 27.12 0.82
SwinFIR 26.83 0.81
SuRGe (Ours) 32.31 0.88
Kitti2015 NAFSSR-L 26.96 0.82
SwinFIR 26.00 0.80
SuRGe (Ours) 31.12 0.89
Middleburry NAFSSR-L 30.20 0.85
SwinFIR 30.01 0.86
SuRGe (Ours) 35.72 0.93
MANGA109 SRCNN 27.66 0.86
DBPN-RES-MR64-3 31.74 0.92
HAT-L 33.09 0.93
HAT 32.87 0.93
SwinFIR 32.83 0.93
SwinIR+ 32.22 0.92
SAN 31.66 0.92
SuRGe (Ours) 34.17 0.95
  • The best result is boldfaced, while the second best is underlined.

IV-C Quantitative performance of SuRGe

In Table II we exhibit the efficacy of the proposed SuRGe on Set5, Set14, BSD100, and Urban100 benchmarks in terms of PSNR and SSIM. For comparison, we select 16 state-of-the-art methods from 4 groups. (1) CNN-based: SRCNN [6], SelfExSR [4], and DBPN-RES-MR64-3 [50]. (2) GAN-based: SRGAN [8], ProSR-L [37], ESRGAN [26], RankSRGAN [51], Beby-GAN [29], and GramGAN [52]. (3) CNN with attention-based: SAN [30] and WRAN [53]. (4) Transformer-based: SwinIR and SWIN-IR+ [17], LTE [18], SwinFIR [31], and HAT-L [33]. Table II shows that except Set5, on all the other datasets, SuRGe performs better in terms of both indices, attesting to its consistency. In terms of percentage points (pp), SuRGe improves the PSNR and SSIM metrics respectively with 1.89pp and 6.3pp, on average on Set14, BSD100, and Urban100. In Set5, even though SuRGe achieves the best SSIM jointly with the transformer-based SwinFIR, the PSNR is slightly lower than two competitors. This may be attributed to the exceptionally smaller LR images in Set5. Such LR image contains more intricate details in a lesser number of pixels. Consequently, the SR output of SuRGe retains most of the high-level visual similarity to attain a commendable SSIM, while the loss of some details is apparent from the slightly lower PSNR.

We further evaluate the efficacy of SuRGe on Kitti2012, Kitti2015, Middlebury, PIRM, OST300, and MANGA109. We compare the performance of SuRGe in terms of PSNR and SSIM in Table III against ten notable contenders viz. SRCNN, DBPN-RES-MR64-3, ESRGAN+ [27], SAN, RankSRGAN [51], NAFSSR-L [32], SwinIR+, SwinFIR, HAT [33], and HAT-L. We see from Table III that SuRGe achieves better PSNR and SSIM on all six datasets. This establishes the power of SuRGe in consistently generating better quality SR outputs.

IV-D Qualitative comparison

We compare the visual quality of SuRGe against BSRGAN [16], SRGAN, ESRGAN [26], Real-ESRGAN [28], LTE, and SwinIR in Figure 6. From Figure 6, we can make three key observations. (1) SRGAN, LTE, and SWIN-IR output comparatively blurry SR than SuRGe. (2) ESRGAN and Real-ESRGAN, though preserve finer details may add more distortion and noise to SR. This is apparent from the mustache of the baboon, the eyebrow of the child, and the nails of the comic. (3) BSRGAN provides smooth, apparently attractive SR outputs but fails to conserve details to the limit of SuRGe. Thus, SuRGe produces better and more detailed SR outputs closer to the HR ground truths. Additional qualitative results along with quantitative support in favor of SuRGe can be found in Additional Results.

Refer to caption
Figure 6: A qualitative comparison of SR outputs generated by SRGAN [8], BSRGAN [16], ESRGAN [26], Real-ESRGAN [28], LTE [18], SWIN-IR [17] and SuRGe (Ours) in baboon (Set14), child (Set5) and comic (Set14) samples. The SR output of SuRGe is visually more similar to the HR ground truth for almost every patch for the three images.

V Conclusion and Future Works

We propose a fully-convolutional GAN-based SuRGe that generates visually attractive 4444x super-resolution images with minute details. SuRGe highlights the need for diversely informative feature preservation and their combination in a learnable fashion in super-resolution task. Further, possibly for the first time SuRGe successfully applies divergence measures such as GW as loss functions in the super-resolution context. Moreover, SuRGe demonstrates how the choices of kernel size, normalization methods, and the location and strategy of up-scaling impact the quality of the generated SR output. Furthermore, the commendable performance of SuRGe comes at a smaller model (in compariosn to GAN-based SR methods) with a low inference time, as shown in Additional Results. The use of divergence measures though considerably improves the performance of SuRGe they may also fall prey to noise present in the images, somewhat compromising the robustness in the process. This may be mitigated in the future by either exploring the applicability of robust divergence measures [54] or incorporating remedial techniques like median of means [55]. Moreover, the currently proposed architecture is tailored for 4x super-resolution as that is the most common and widely popular variant of the task. Generalizing the proposed model to a r𝑟ritalic_rx super-resolution where r𝑟ritalic_r is even, likely will not pose a significant challenge, though additional correction of up-sampled output may be required with increasing r𝑟ritalic_r. However, the case may become even more complicated if r𝑟ritalic_r becomes odd. A potential remedy can be in the form of a robust and adaptive up-sampling strategy that will incur minimal distortion and preferably operate in a self-correcting mode to enable a better super-resolution output.

References

  • [1] O. Rukundo and H. Cao, “Nearest neighbor value interpolation,” arXiv preprint arXiv:1211.1768, 2012.
  • [2] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in European Conference on Computer Vision, 2014, pp. 372–386.
  • [3] P. S. Parsania and P. V. Virparia, “A comparative analysis of image interpolation algorithms,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 5, no. 1, pp. 29–34, 2016.
  • [4] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
  • [5] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1664–1673.
  • [6] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2015.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  • [8] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
  • [9] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016.
  • [10] Y. Wu and J. Johnson, “Rethinking ”batch” in batchnorm,” arXiv preprint arXiv:2105.07576, 2021.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [12] ——, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE/CVF International Conference on Computer Vision, 2015, pp. 1026–1034.
  • [13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer Assisted Intervention, 2015, pp. 234–241.
  • [14] C. Zhang, F. Rameau, S. Lee, J. Kim, P. Benz, D. M. Argaw, J.-C. Bazin, and I. S. Kweon, “Revisiting residual networks with nonlinear shortcuts.” in British Machine Vision Conference, 2019, p. 12.
  • [15] C. Zhang, P. Benz, D. M. Argaw, S. Lee, J. Kim, F. Rameau, J.-C. Bazin, and I. S. Kweon, “Resnet or densenet? introducing dense shortcuts to resnet,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3550–3559.
  • [16] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1604–1613.
  • [17] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in IEEE/CVF International Conference on Computer Vision (Workshop), 2021, pp. 1833–1844.
  • [18] J. Lee and K. H. Jin, “Local texture estimator for implicit representation function,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1929–1938.
  • [19] F. Mémoli, “Gromov–wasserstein distances and the metric approach to object matching,” Foundations of computational mathematics, vol. 11, pp. 417–487, 2011.
  • [20] S. Datta, S. S. Mullick, A. Chakrabarty, and S. Das, “Interval bound interpolation for few-shot learning with few tasks,” in International Conference on Machine Learning, 2023, pp. 7141–7166.
  • [21] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2017, pp. 1122–1131.
  • [22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [23] A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in IEEE International Conference on Pattern Recognition, 2010, pp. 2366–2369.
  • [24] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [25] J. Xin, J. Li, X. Jiang, N. Wang, H. Huang, and X. Gao, “Wavelet-based dual recursive network for image super-resolution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, pp. 707–720, 2020.
  • [26] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in European Conference on Computer Vision (Workshop), 2018, pp. 1–16.
  • [27] N. C. Rakotonirina and A. Rasoanaivo, “Esrgan+: Further improving enhanced super-resolution generative adversarial network,” in IEEE ICASSP, 2020, pp. 3637–3641.
  • [28] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.
  • [29] W. Li, K. Zhou, L. Qi, L. Lu, and J. Lu, “Best-buddy gans for highly detailed image super-resolution,” in AAAI Conference on Artificial Intelligence, 2022, pp. 1412–1420.
  • [30] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for magnification-arbitrary single image super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 065–11 074.
  • [31] D. Zhang, F. Huang, S. Liu, X. Wang, and Z. Jin, “Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution,” arXiv preprint arXiv:2208.11247, 2022.
  • [32] X. Chu, L. Chen, and W. Yu, “Nafssr: stereo image super-resolution using nafnet,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1239–1248.
  • [33] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more pixels in image super-resolution transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 367–22 377.
  • [34] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yang, and F. Yu, “Dual aggregation transformer for image super-resolution,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 12 312–12 321.
  • [35] Y. Zhou, Z. Li, C.-L. Guo, S. Bai, M.-M. Cheng, and Q. Hou, “Srformer: Permuted self-attention for single image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 780–12 791.
  • [36] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein, “The intrinsic dimension of images and its impact on learning,” in International Conference on Learning Representations, 2021.
  • [37] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers, “A fully progressive approach to single-image super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2018, pp. 864–873.
  • [38] S. Bianco, C. Cusano, and R. Schettini, “Color constancy using cnns,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2015, pp. 81–89.
  • [39] T. Karras, T. Aila et al., “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
  • [40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
  • [41] M. Bevilacqua, A. Roumy, C. Guillemot, and M. line Alberi Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in British Machine Vision Conference, 2012, pp. 135.1–135.10.
  • [42] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces, 2012, pp. 711–730.
  • [43] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in IEEE/CVF International Conference on Computer Vision, 2001, pp. 416–423.
  • [44] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [45] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, pp. 7–42, 2002.
  • [46] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 pirm challenge on perceptual image super-resolution,” in European Conference on Computer Vision (Workshop), 2018, pp. 1–22.
  • [47] X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
  • [48] Y. Matsui, K. Ito et al., “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, pp. 21 811–21 838, 2017.
  • [49] C. Bunne, D. Alvarez-Melis, A. Krause, and S. Jegelka, “Learning generative models across incomparable spaces,” in International Conference on Machine Learning, 2019, pp. 851–861.
  • [50] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for single image super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4323–4337, 2021.
  • [51] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “Ranksrgan: Generative adversarial networks with ranker for image super-resolution,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 3096–3105.
  • [52] J. Song, H. Yi, W. Xu, B. Li, and X. Li, “Gram-gan: Image super-resolution based on gram matrix and discriminator perceptual loss,” Sensors, vol. 23, no. 4, 2023.
  • [53] S. Xue, W. Qiu, F. Liu, and X. Jin, “Wavelet-based residual attention network for image super-resolution,” Neurocomputing, vol. 382, pp. 116–126, 2020.
  • [54] Y. He, A. B. Hamza, and H. Krim, “A generalized divergence measure for robust image registration,” IEEE Transactions on Signal Processing, vol. 51, no. 5, pp. 1211–1220, 2003.
  • [55] G. Lecué and M. Lerasle, “Robust machine learning by median-of-means: Theory and practice,” The Annals of Statistics, vol. 48, no. 2, pp. 906 – 931, 2020.

Detailed architecture of the SuRGe network

Refer to Figure 7 and 8 for the detailed schematic description of the respective architectures of Generator G𝐺Gitalic_G and Discriminator D𝐷Ditalic_D in SuRGe.

-A Architecture of Generator

We present the architecture of Generator G𝐺Gitalic_G in SuRGe. Taking a random LR patch from Baboon as an example input, the model constructed with nG=8subscript𝑛𝐺8n_{G}=8italic_n start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 8 outputs a SR patch by passing it through different blocks as indicated by the legend.

Refer to caption
Figure 7: The architecture of G𝐺Gitalic_G in SuRGe with the different individual components described in details.

-B Architecture of Discriminator

We demonstrate the discriminator model D𝐷Ditalic_D of SuRGe in Figure 8. All blocks presented in the figure follows the same naming convention discussed in the Section III-B of the main paper. Discriminator D𝐷Ditalic_D uses nD=4subscript𝑛𝐷4n_{D}=4italic_n start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = 4 along with a architecture of sub-blocks that is structurally similar to the generator G𝐺Gitalic_G except the presence of normalisation. The classification head H𝐻Hitalic_H is responsible for the distinguishing between real (HR) and fake (SR) image samples.

Refer to caption
Figure 8: The architecture of D𝐷Ditalic_D in SuRGe with a detailed description of the individual components is presented.

Algorithm of SuRGe

The following Algorithm 1 describes the workflow of SuRGe.

Algorithm 1 Super-Resolution Generator (SuRGe)

Input: YGT={𝐲1GT,𝐲2GT,,𝐲mGT}superscript𝑌𝐺𝑇subscriptsuperscript𝐲𝐺𝑇1subscriptsuperscript𝐲𝐺𝑇2normal-⋯subscriptsuperscript𝐲𝐺𝑇𝑚Y^{GT}=\{\mathbf{y}^{GT}_{1},\mathbf{y}^{GT}_{2},\cdots,\mathbf{y}^{GT}_{m}\}italic_Y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT = { bold_y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }: training set of full HR Ground Truth (GT) images, N𝑁Nitalic_N: mini-batch size, T𝑇Titalic_T: Number of epochs as a termination criterion.
Output: A trained super-resolution image generator network G𝐺Gitalic_G.

1: Initialize epoch counter t=1𝑡1t=1italic_t = 1.
2:while tT𝑡𝑇t\leq Titalic_t ≤ italic_T do
3:    Initialize Y=ϕ𝑌italic-ϕY=\phiitalic_Y = italic_ϕ.
4:    for Each of 𝐲GTYGTsuperscript𝐲𝐺𝑇superscript𝑌𝐺𝑇\mathbf{y}^{GT}\in Y^{GT}bold_y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT ∈ italic_Y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT do
5:       Y=Y{𝐲}𝑌𝑌𝐲Y=Y\cup\{\mathbf{y}\}italic_Y = italic_Y ∪ { bold_y }, where 𝐲𝐲\mathbf{y}bold_y is a 256×256256256256\times 256256 × 256 patch, randomly extracted from 𝐲GTsuperscript𝐲𝐺𝑇\mathbf{y}^{GT}bold_y start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT.
6:    end for
7:    Sample YN={𝐲1,𝐲2,,𝐲N}Ysubscript𝑌𝑁subscript𝐲1subscript𝐲2subscript𝐲𝑁𝑌Y_{N}=\{\mathbf{y}_{1},\mathbf{y}_{2},\cdots,\mathbf{y}_{N}\}\subset Yitalic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ italic_Y of a batch of HR ground truth patches.
8:    Form XN={𝐱1,𝐱2,,𝐱N}subscript𝑋𝑁subscript𝐱1subscript𝐱2subscript𝐱𝑁X_{N}=\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } of LR training batch where 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is formed by down-scaling 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 64×64646464\times 6464 × 64 by bi-cubic interpolation for all i=1,2,.Nformulae-sequence𝑖12𝑁i=1,2,\cdots.Nitalic_i = 1 , 2 , ⋯ . italic_N.
9:    Update G𝐺Gitalic_G by gradient descent on G(XN,YN,G(XN))superscript𝐺subscript𝑋𝑁subscript𝑌𝑁𝐺subscript𝑋𝑁\mathcal{L}^{G}(X_{N},Y_{N},G(X_{N}))caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_G ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ).
10:    Sample SR output batch G(XN)={G(𝐱1),G(𝐱2),,G(𝐱N)}𝐺subscript𝑋𝑁𝐺subscript𝐱1𝐺subscript𝐱2𝐺subscript𝐱𝑁G(X_{N})=\{G(\mathbf{x}_{1}),G(\mathbf{x}_{2}),\cdots,G(\mathbf{x}_{N})\}italic_G ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = { italic_G ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_G ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , italic_G ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } where 𝐱iXnsubscript𝐱𝑖subscript𝑋𝑛\mathbf{x}_{i}\in X_{n}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.
11:    Update D𝐷Ditalic_D by gradient descent on D(YN,G(XN))superscript𝐷subscript𝑌𝑁𝐺subscript𝑋𝑁\mathcal{L}^{D}(Y_{N},G(X_{N}))caligraphic_L start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_G ( italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ).
12:    Increase epoch counter t=t+1𝑡𝑡1t=t+1italic_t = italic_t + 1.
13:end while

Description of datasets

The following Table IV provides the details of the DIV2K training dataset [21] and the 10 benchmark testing datasets.

TABLE IV: Details of datasets
Dataset Number of Average ground

Remark

samples truth resolution
DIV2K [21] 800800800800 1971×1435197114351971\times 14351971 × 1435

Only the training split is used to train SuRGe.

Set5 [41] 5555 78×84788478\times 8478 × 84

Benchmark, contains 5 samples in total.

Set14 [42] 14141414 112×101112101112\times 101112 × 101

Benchmark, contains 14 samples in total.

BSD100 [43] 100100100100 110×8911089110\times 89110 × 89

Benchmark, contains 100 samples in total.

Urban100 [4] 100100100100 246×199246199246\times 199246 × 199

Benchmark, contains 100 samples in total.

PIRM [46] 100100100100 155×119155119155\times 119155 × 119

contains 100 samples from validation set.

KITTI2012 [44] 20202020 300×9630096300\times 96300 × 96

The same 20 samples are taken as in [32].

KITTI2015 [44] 20202020 300×9630096300\times 96300 × 96

The same 20 samples are taken as in [32].

Middlebury [45] 10101010 438×312438312438\times 312438 × 312

The same 10 samples are taken as in [32].

OST300 [47] 300300300300 160×123160123160\times 123160 × 123

contains 300 samples in total.

MANGA109 [48] 109109109109 207×300207300207\times 300207 × 300

contains 109 samples in total.

Metrics

We use two metrics to quantify and compare the performance of the super-resolution methods.

-C Peak Signal to Noise Ratio (PSNR)

PSNR [23] is a measure to quantify the noise present in an image compared to a reference. In case of super-resolution, if the ground truth is 𝐲𝐲\mathbf{y}bold_y and the SR output generated by G𝐺Gitalic_G from a LR input 𝐱𝐱\mathbf{x}bold_x is G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ), then the PSNR (ρ𝜌\rhoitalic_ρ) between the two is defined as:

ρ(𝐲,G(𝐱))=20log10[max{𝐲}1wh𝐲G(𝐱)22],𝜌𝐲𝐺𝐱20subscript10𝐲1𝑤subscriptsuperscriptnorm𝐲𝐺𝐱22\rho(\mathbf{y},G(\mathbf{x}))=20\log_{10}\left[\frac{\max\{\mathbf{y}\}}{% \frac{1}{wh}||\mathbf{y}-G(\mathbf{x})||^{2}_{2}}\right],italic_ρ ( bold_y , italic_G ( bold_x ) ) = 20 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT [ divide start_ARG roman_max { bold_y } end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_w italic_h end_ARG | | bold_y - italic_G ( bold_x ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ] , (8)

where w𝑤witalic_w and hhitalic_h respectively denote the width and height or 𝐲𝐲\mathbf{y}bold_y oder G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ). Evidently, in equation (8) we want to decrease 𝐲G(𝐱)22subscriptsuperscriptnorm𝐲𝐺𝐱22||\mathbf{y}-G(\mathbf{x})||^{2}_{2}| | bold_y - italic_G ( bold_x ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT so that the SR output matches with the HR ground truth. In other words, a higher PSNR indicates a better quality SR.

-D Structural Similarity Index (SSIM)

SSIM [24] is another commonly used metric to compare an SR output to a reference HR ground truth. Similar to PSNR, given a SR output G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ) and a HR ground truth 𝐲𝐲\mathbf{y}bold_y the SSIM λ𝜆\lambdaitalic_λ between the two is calculated as follows:

λ(𝐲,G(𝐱))=(2μ𝐲μG(𝐱)+c1)(2Cov(𝐲,G(𝐱))+c2)(μ𝐲2+μG(𝐱)2+c1)(σ𝐲2+σG(𝐱)2+c2),𝜆𝐲𝐺𝐱2subscript𝜇𝐲subscript𝜇𝐺𝐱subscript𝑐12𝐶𝑜𝑣𝐲𝐺𝐱subscript𝑐2superscriptsubscript𝜇𝐲2subscriptsuperscript𝜇2𝐺𝐱subscript𝑐1superscriptsubscript𝜎𝐲2subscriptsuperscript𝜎2𝐺𝐱subscript𝑐2\lambda(\mathbf{y},G(\mathbf{x}))=\frac{\left(2\mu_{\mathbf{y}}\mu_{G(\mathbf{% x})}+c_{1}\right)\left(2Cov(\mathbf{y},G(\mathbf{x}))+c_{2}\right)}{\left(\mu_% {\mathbf{y}}^{2}+\mu^{2}_{G\left(\mathbf{x}\right)}+c_{1}\right)\left(\sigma_{% \mathbf{y}}^{2}+\sigma^{2}_{G\left(\mathbf{x}\right)}+c_{2}\right)},italic_λ ( bold_y , italic_G ( bold_x ) ) = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_C italic_o italic_v ( bold_y , italic_G ( bold_x ) ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (9)

where, μ𝐲subscript𝜇𝐲\mu_{\mathbf{y}}italic_μ start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT and μG(𝐱)subscript𝜇𝐺𝐱\mu_{G(\mathbf{x})}italic_μ start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT are respectively the mean pixel values of HR ground truth and SR output, σ𝐲2superscriptsubscript𝜎𝐲2\sigma_{\mathbf{y}}^{2}italic_σ start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σG(𝐱)2subscriptsuperscript𝜎2𝐺𝐱\sigma^{2}_{G\left(\mathbf{x}\right)}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G ( bold_x ) end_POSTSUBSCRIPT denote the variance of pixel values in 𝐲𝐲\mathbf{y}bold_y and G(𝐱)𝐺𝐱G(\mathbf{x})italic_G ( bold_x ), while Cov(𝐲,G(𝐱))𝐶𝑜𝑣𝐲𝐺𝐱Cov(\mathbf{y},G(\mathbf{x}))italic_C italic_o italic_v ( bold_y , italic_G ( bold_x ) ) is the covariance between the pixel values of the two images. The two constants c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ensure the non-zero property for the denominator and are usually kept to small values. A higher SSIM indicates the two images are more perceptually similar.

Network architecture selection by grid search

The grid search spaces and the final selected networks along with hyper-parameter settings for the generator G𝐺Gitalic_G and the discriminator D𝐷Ditalic_D in SuRGe are respectively detailed in Tables V and VI.

TABLE V: Grid search space with the selected G𝐺Gitalic_G architecture and hyperparameter settings of SuRGe.
Block Parameters Grid search space Final network
C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Kernel size (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {7,9,11}7911\{7,9,11\}{ 7 , 9 , 11 } 9
No. of convolutional filters (C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) {32,64}3264\{32,64\}{ 32 , 64 } 64
No. of convolutional filters (C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {64,128}64128\{64,128\}{ 64 , 128 } 128
Stride (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 1 1
Padding (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {1,2,4}124\{1,2,4\}{ 1 , 2 , 4 } 4
Activation (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) ReLU, LeakyReLU, PReLU PReLU
Normalization (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) True, False False
Normalization technique (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) -- --
No. of convolution layers (C0,C1subscript𝐶0subscript𝐶1C_{0},C_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {1,2}12\{1,2\}{ 1 , 2 } 1
R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Inter-sub-blocks skip connection (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) True, False True
No. of convolutions in sub-block (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {1,2}12\{1,2\}{ 1 , 2 } 1
No. of convolution filters (R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) {32,64}3264\{32,64\}{ 32 , 64 } 64
No. of convolution filters (R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {64,128}64128\{64,128\}{ 64 , 128 } 128
Kernel size (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {3,5}35\{3,5\}{ 3 , 5 } 3
(Stride, Padding) (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) (1, 1) (1, 1)
Activation presence in sub-blocks (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) First, Second, Both First
Normalization in sub-blocks (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) First, Second, Both Both
Activation technique (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) ReLU, LeakyReLU, PReLU PReLU
Normalization technique (R0,R1subscript𝑅0subscript𝑅1R_{0},R_{1}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) BatchNorm, PixelNorm BatchNorm
I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Kernel size (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {3,5}35\{3,5\}{ 3 , 5 } 3
No. of convolutional filters (I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) {32,64}3264\{32,64\}{ 32 , 64 } 64
No. of convolutional filters (I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {64,128}64128\{64,128\}{ 64 , 128 } 128
(Stride, Padding) (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) (1, 1) (1, 1)
Activation (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) ReLU, LeakyReLU, PReLU PReLU
Normalization (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) True, False True
Normalization technique (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) BatchNorm, PixelNorm BatchNorm
No. of convolution layers (I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {1,2}12\{1,2\}{ 1 , 2 } 1
U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Initial convolution present True, False True
Kernel size of convolution (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {3,5}35\{3,5\}{ 3 , 5 } 3
No. of convolutional filters (U0subscript𝑈0U_{0}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) {32,64}3264\{32,64\}{ 32 , 64 } 64
No. of convolutional filters (U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {64,128}64128\{64,128\}{ 64 , 128 } 128
(Stride, Padding) (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) (1, 1) (1, 1)
Activation (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) ReLU, LeakyReLU, PReLU PReLU
Normalization (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) True, False False
Normalization technique (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) -- --
No. of convolution layers (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) {1,2}12\{1,2\}{ 1 , 2 } 1
Up-scaling technique (U0,U1subscript𝑈0subscript𝑈1U_{0},U_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) Bi-cubic, PixelShuffle, NN*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT NN
S𝑆Sitalic_S Structure similar to U0C1subscript𝑈0subscript𝐶1U_{0}\circ C_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∘ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT U0C1subscript𝑈0subscript𝐶1U_{0}\circ C_{1}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∘ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Optimizer Learning rate {0.0001,0.00025,0.0005}0.00010.000250.0005\{0.0001,0.00025,0.0005\}{ 0.0001 , 0.00025 , 0.0005 } 0.0001
Adam (β1,β2subscript𝛽1subscript𝛽2\beta_{1},\beta_{2}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) (0.9, 0.99) (0.9, 0.99)
  • NN*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: Nearest Neighbour method.

TABLE VI: Grid search space with the selected D𝐷Ditalic_D architecture and hyperparameter settings of SuRGe.
Block Parameters Grid search space Final network
C𝐶Citalic_C Kernel size {3,5}35\{3,5\}{ 3 , 5 } 3
No. of convolutional filters {32,64}3264\{32,64\}{ 32 , 64 } 64
(Stride, Padding) (1, 1) (1, 1)
Activation ReLU, LeakyReLU, PReLU LeakyReLU
Leakiness of LeakyReLU {0.1,0.2}0.10.2\{0.1,0.2\}{ 0.1 , 0.2 } 0.2
Normalization True, False False
Normalization technique -- --
No. of convolution layers {1,2}12\{1,2\}{ 1 , 2 } 1
B𝐵Bitalic_B Inter-sub-blocks skip (B𝐵Bitalic_B) True, False True
No. of B𝐵Bitalic_B blocks {1,2,3,4}1234\{1,2,3,4\}{ 1 , 2 , 3 , 4 } 4
No. of conv in sub (B0B3subscript𝐵0subscript𝐵3B_{0}-B_{3}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) {1,2}12\{1,2\}{ 1 , 2 } 1
No. of conv filters (B0B3subscript𝐵0subscript𝐵3B_{0}-B_{3}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) (64,128,256,512)64128256512(64,128,256,512)( 64 , 128 , 256 , 512 ) (64,128,256,512)64128256512(64,128,256,512)( 64 , 128 , 256 , 512 )
Kernel size (B0B3subscript𝐵0subscript𝐵3B_{0}-B_{3}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) (3, 5, 7, 9) (3, 5, 7, 9)
(Stride, Padding) (B𝐵Bitalic_B) (1, 1) (1, 1)
Activation in sub (B𝐵Bitalic_B) First, Second, Both First
Normalization in sub (B𝐵Bitalic_B) First, Second, Both Both
Activation technique (B𝐵Bitalic_B) ReLU, LeakyReLU, PReLU LeakyReLU
Leakiness of LeakyReLU (B𝐵Bitalic_B) {0.1,0.2}0.10.2\{0.1,0.2\}{ 0.1 , 0.2 } 0.2
Normalization technique (B𝐵Bitalic_B) BatchNorm, PixelNorm PixelNorm
H𝐻Hitalic_H Pooling strategy avgPool, adaptiveAvgPool, maxPool adaptiveAvgPool
Output size of adaptiveAvgPool {4,6,8}468\{4,6,8\}{ 4 , 6 , 8 } 6
No. of dense layers {1,2,3}123\{1,2,3\}{ 1 , 2 , 3 } 2
No. of dense nodes {(512,1),(1024,1),(2048,1)}51211024120481\{(512,1),(1024,1),(2048,1)\}{ ( 512 , 1 ) , ( 1024 , 1 ) , ( 2048 , 1 ) } (1024, 1)
Activation ReLU, LeakyReLU, PReLU LeakyReLU
Leakiness of LeakyReLU {0.1,0.2}0.10.2\{0.1,0.2\}{ 0.1 , 0.2 } 0.2
Normalization True, False True
Normalization technique -- --
Gradient Weight γ𝛾\gammaitalic_γ {1,10,100}110100\{1,10,100\}{ 1 , 10 , 100 } 10
Penalty
Optimizer Learning rate {0.0001,0.00025,0.0005}0.00010.000250.0005\{0.0001,0.00025,0.0005\}{ 0.0001 , 0.00025 , 0.0005 } 0.0001
Adam (β1,β2subscript𝛽1subscript𝛽2\beta_{1},\beta_{2}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) (0, 0.9) (0, 0.9)

Additional Results

-E Inference time of SuRGe compared to recent GAN and Transformer-based contenders

To confirm if the several improvements made in SuRGe maintain a low inference time we compare the same in seconds against that of three recent GAN and transformer models viz. BSRGAN, SwinFIR, and LTE in the same computing setup. In essence, all algorithms are executed on a computing system with a single AMD Ryzen 9 3900x 12-core processor, one NVIDIA RTX 3090 24GB GPU, and a total of 64GB DDR4 memory. The following Table VII shows that on average on three datasets namely Set5, Set14, and BSD100, the proposed SuRGe provides good quality SR at the lowest inference time.

TABLE VII: Average inference time in seconds for super-resolution of a single image. The best result is boldfaced and the second best is underlined.
Model Set5 Set14 BSD100
BSRGAN 0.059 0.095 0.077
LTE 0.080 0.132 0.099
SwinFIR 1.032 1.886 1.389
SuRGe (Ours) 0.055 0.089 0.067
Speed-up of SuRGe from the current best model 1.07x 1.07x 1.15x

-F A comparison of performance vs. number of parameters for SuRGe and its notable GAN-based contenders

In the following Table VIII we provide a comparative study of performance (averaged over the four benchmark datasets viz. Set5 [41], Set14 [42], BSD100 [43], and URBAN100 [4], in terms of PSNR and SSIM) vs. the number of parameters (in millions) for GAN-based super-resolution techniques like the proposed SuRGe, PROSR-L [37], SRGAN [8], ESRGAN [26], BeByGAN [29], Rank-SRGAN [51]. We see from Table VIII that even though SuRGe uses about 2×2\times2 × parameters of PROSR-L it improves the PSNR with about 2.55 percentage points (pp) on average. Moreover, SuRGe and SRGAN use an almost similar number of parameters while the proposed outperforms the contender by about 4pp in PSNR and 13pp in SSIM. The rest of the competing methods use about 1.5x-2x parameters of SuRGe while achieving a lower PSNR and SSIM on average. Thus, SuRGe can be considered a comparatively lightweight GAN-based method that demonstrates high performance while maintaining a limit on the number of parameters.

TABLE VIII: Comparison of performance vs. the number of parameters for GAN-based approaches.
GAN-based Technique No. of Parameters Average Performance
(in millions) PSNR(\uparrow) SSIM(\uparrow)
PROSR-L [37] 13 (0.5×\approx 0.5\times≈ 0.5 ×) 27.78 (4.334.33-4.33- 4.33) --
SRGAN [8] 25.1 (1×\approx 1\times≈ 1 ×) 26.19 (5.925.92-5.92- 5.92) 0.75 (0.140.14-0.14- 0.14)
ESRGAN [26] 40 (1.5×\approx 1.5\times≈ 1.5 ×) 29.15 (2.962.96-2.96- 2.96) 0.81 (0.080.08-0.08- 0.08)
BeByGAN [29] 40 (1.5×\approx 1.5\times≈ 1.5 ×) 26.57 (5.545.54-5.54- 5.54) 0.74 (0.150.15-0.15- 0.15)
Rank-SRGAN [51] 53 (2×\approx 2\times≈ 2 ×) 26.07 (6.046.04-6.04- 6.04) 0.65 (0.240.24-0.24- 0.24)
SuRGe (Ours) 25.7 32.11 0.89
  • Green indicates the number of parameters/performance of the contender is better than SuRGe. Red indicates the number of parameters/performance of the contender is worse than SuRGe.

  • The ratio of the number of parameters and the difference in performance is measured considering SuRGe as a reference.

-G Simultaneous demonstration of qualitative and quantitative performances of SuRGe

Additionally, Figure 9 shows a qualitative performance comparison of SuRGe with five notable contenders namely SRGAN [8], BSRGAN [16], Real-ESRGAN [28], LTE [18], and SWIN-IR [17], on five test images viz. Cars (PIRM), Statues (BSD100), Balloons (PIRM), Horses (BSD100), and Lioness (OST300). For a simultaneous quantitative comparison we present the SSIM and PSNR values for the six algorithms on the five samples in Table IX. Figure 9 demonstrates how SuRGe maintains a higher degree of details in the SR output while the same is further quantitatively attested by the improved PSNR and SSIM in Table IX. Moreover, for the ease of visualization, a patch-based qualitative comparison is also presented in Figures 10 and 11 that enables focused observation on a particular region of the SR output.

Refer to caption
Figure 9: Additional qualitative performance comparison of SuRGe. The proposed SuRGe consistently produces SR outputs with finer details.
TABLE IX: Additional quantitative performance comparison of SuRGe with the five contenders and five test samples used in Figure 9. The best is boldfaced while the second-best in underlined.
Method Metric Cars Statues Balloons Horses Lioness
SRGAN PSNR 29.52 28.83 29.77 29.00 30.54
SSIM 0.57 0.46 0.65 0.59 0.71
BSRGAN PSNR 30.29 29.29 30.38 31.05 30.67
SSIM 0.53 0.49 0.61 0.63 0.70
Real-ESRGAN PSNR 30.38 29.10 30.32 30.92 31.08
SSIM 0.58 0.48 0.65 0.62 0.70
LTE PSNR 31.56 29.61 30.51 31.83 32.07
SSIM 0.71 0.58 0.79 0.70 0.79
SWIN-IR PSNR 31.69 29.65 31.66 31.93 32.13
SSIM 0.72 0.59 0.81 0.72 0.79
SuRGe (ours) PSNR 34.21 32.79 32.31 33.13 34.74
SSIM 0.82 0.79 0.86 0.86 0.91
Refer to caption
Figure 10: We take three test samples, namely Cars (PIRM), Balloons (PIRM) and Statues (BSD100). We compare SuRGe with the contenders like SRGAN [8], BSRGAN [16], Real-ESRGAN [28], LTE [18], and SWIN-IR [17]. A patch-based visual comparison further aids us in clearly observing the greater amount of minute details (sunlight reflection on car headlight, eye details of the balloon toys, stone texture in statue) preserved by SuRGe in the SR output.
Refer to caption
Figure 11: In continuation of Figure 10 we take two more test samples namely Lioness (OST300) and Horses (BSD100) and compare the SR outputs of the same six methods. Here also, we see that SuRGe retains the finer details such as the fur of the lioness and her cubs, or the mane of the horses.