Fortifying Fully Convolutional Generative Adversarial Networks for Image Super-Resolution Using Divergence Measures

Arkaprabha Basu, Kushal Bose, Sankha Subhra Mullick, Anish Chakrabarty, and Swagatam Das Arkaprabha Basu, Kushal Bose, Sankha Subhra Mullick, Anish Chakrabarty and Swagatam Das ([email protected]) are with the Electronics and Communication Sciences Unit (ECSU) and Statistics and Mathematics Unit (SMU), Indian Statistical Institute, Kolkata, India Corresponding author: Swagatam Das.

Abstract

Super-Resolution (SR) is a time-hallowed image processing problem that aims to improve the quality of a Low-Resolution (LR) sample up to the standard of its High-Resolution (HR) counterpart. We aim to address this by introducing Super-Resolution Generator (SuRGe), a fully-convolutional Generative Adversarial Network (GAN)-based architecture for SR. We show that distinct convolutional features obtained at increasing depths of a GAN generator can be optimally combined by a set of learnable convex weights to improve the quality of generated SR samples. In the process, we employ the Jensen–Shannon and the Gromov-Wasserstein losses respectively between the SR-HR and LR-SR pairs of distributions to further aid the generator of SuRGe to better exploit the available information in an attempt to improve SR. Moreover, we train the discriminator of SuRGe with the Wasserstein loss with gradient penalty, to primarily prevent mode collapse. The proposed SuRGe, as an end-to-end GAN workflow tailor-made for super-resolution, offers improved performance while maintaining low inference time. The efficacy of SuRGe is substantiated by its superior performance compared to 18 state-of-the-art contenders on 10 benchmark datasets.

Index Terms:

Generative Adversarial Networks, Image Super-Resolution, Convolutional Neural Networks, Divergence Measures

I Introduction

A Low Resolution (LR) image sacrifices information of its High Resolution (HR) counterpart in favor of general utility such as displaying or editing in smaller screens, low storage requirement, and fast transmission. Super-resolution attempts to recover the original HR copy from a LR input. However, the initial HR to LR transformation is commonly non-invertible and lossy [1]. Thus, recovering the HR by estimating a Super Resolution (SR) analog is an ill-posed problem that contains the risk of a distorted output [2].

The classical interpolation methods for super-resolution only exploit local information and are thus incapable of generating commendable SR [3]. While global image features extracted by the deep convolutional networks translate to a much improved performance [4, 5] limited generalizability and distorted SR still remain as major concerns [6].

The landscape of super-resolution techniques had a major breakthrough with the advent of the Generative Adversarial Network (GAN) [7]. A super-resolution GAN [8] embraces the canonical two-player adversarial game between a generator $G$ and a discriminator $D$ with some minor modifications. Specifically, in a super-resolution task, $G$ attempts to map a LR input to a HR ground truth, generating an estimated SR in the process. The discriminator $D$ helps $G$ by providing adversarial feedback through distinguishing between a HR ground truth and $G$ generated SR. While GAN-based super-resolution offers generalizability through their generative power they often have SR outputs that lose finer details or are plagued by artifacts [9, 10].

In this paper, we propose a GAN-based super-resolution method called Super-Resolution Generator (SuRGe). In a super-resolution task, to generate a good quality SR image, it is necessary to consider both the low-level local features (for example, colors, textures, edges, etc.) and the high-level global ones (such as individual object shapes, relative positioning of objects and background, object orientation, etc.). As noted in [11, 12], higher-level global features are progressively captured by convolutional filters residing deeper in the network. Taking inspiration from [11, 13] in the generator $G$ of the proposed SuRGe, we preserve the hierarchically complex features and dictate their flow through skip connections. However, skip connections may lead to under-utilized network capacity [14] while a potential solution like DenseNet may be challenging to train [15] with limited data. Therefore, in SuRGe, we design a generator $G$ that judiciously uses the skip connections to conserve and propagate only a few selected features that are intuitively more useful for improving the network’s performance on super-resolution task (for example, carrying forward the low-level features to recover minute details after a potentially distortion inducing up-sampling step, in a spirit similar to that of UNet [13]). To adaptively combine the features coming from different depths of the network we introduce mixing modules that operate in a learnable fashion.

Refer to caption — Figure 1: Visual comparison of 4x super-resolution outputs of the proposed SuRGe with SRGAN [8], BSRGAN [16], SWIN-IR [17], and LTE [18], given a low-resolution (LR) input image patch. SuRGe is producing better super-resolution images with finer texture, color, and intricate details.

We further focus on the fact that ideally the distributions of SR and HR should be identical. Thus, a loss function like Jensen–Shannon (JS) divergence that explicitly encourages minimizing the dissimilarity between the respective distributions of HR and SR, helps in training the generator $G$ in SuRGe. Moreover, in the ideal case, LR and SR should also preserve structural similarities, which consequently gets reflected in their corresponding distributions. However, LR and SR reside in different metric spaces with potentially distinct dimensionality. Thus, to explicitly minimize the discrepancy between the respective distributions of LR and SR we further utilize the Gromov Wasserstein (GW) distance [19] as an additional loss function in the generator $G$ of SuRGe. As per our knowledge, this is the first time the applicability of explicit divergence measures is explored in the context of GAN-based super-resolution techniques. We also employ a dynamically weighted convex combination strategy of the multiple losses in $G$ [20] of SuRGe. Furthermore, to prevent $G$ from mode collapsing, especially on the smaller training sets used in super-resolution [21] we employ Wasserstein loss with gradient penalty (WGAN-GP) [22] to train the discriminator $D$ in SuRGe.

The primary contributions of our fully-convolutional GAN-based SR method SuRGe are as follows.

•

To the best of our knowledge, SuRGe is the first super-resolution model that introduces GW, a divergence between metric spaces of potentially different dimensions, to fuel the learning of generator $G$ . This incorporation of LR-SR relationship directly endows SuRGe with authentic super-resolution capabilities.
•

Moving away from pre-trained model-biased perceptual similarity, SuRGe takes JS divergence as an additional loss of generator $G$ (alongside adversarial and GW) while discriminator $D$ uses gradient penalized Wasserstein loss to improve the SR.
•

We introduce a generator $G$ in SuRGe that efficiently employs skip connections to garner semantic information from different levels of feature representation and supports their adaptive mixing in a learnable fashion.

The effect of these critical improvements is evident in the motivational example in Figure 1, where compared to four notable contenders, the SR obtained by SuRGe is richer in finer details and most closely matches the HR. Following a brief review of the existing deep super-resolution strategies using Convolutional Neural Networks and Transformers in Section II, we detail the proposed methodology in Section III. In Section IV we show that the proposed SuRGe outperforms the current best by an average of 3.51% and 5.45% respectively in terms of PSNR [23] and SSIM [24] on four common benchmarks for 4x super-resolution. Further, SuRGe supersedes the state-of-the-arts by 15.19% in terms of PSNR on six complex 4x super-resolution datasets.

II Related Works

Deep super-resolution models can be Convolutional Neural Networks (CNNs), GANs, and, more recently transformers. The CNN-based methods are the first to employ deep networks for super-resolution [6], using convolution maps through the image followed by interpolation methodologies similar to a typical convolutional autoencoder. Though innovative for initial study and duly credited for a remarkable improvement over the traditional techniques, such CNN architectures suffer from poor generalizability and thus are domain-dependent [4]. As a remedy, WDRN [25], employs distinct wavelet features and their adaptive mixture for a better super resolution performance.

The shortcomings of CNN can be addressed using GANs [8] with task-specific modifications. This route of research mainly diverges into three primary avenues. First, removing normalization and introducing dense blocks in generator [26], indeed improve the SR image quality although with a greater computational cost. Second, replacing dense networks with residual backbone [27] utilizing skip connections. Even though, such networks are considerably easier to train, their full potential may not be realized without carefully curating the skip connections and feature mixing that best aids the super-resolution task. Third, additional preprocessing such as blurring and specialized noise injection [28, 16] produces augmentations that are close to real-life scenarios and can lead to more enhanced SR output. Unfortunately, this also purposefully distorts the input distribution that may in turn sacrifice the clarity and details. On the other hand, considering perceptual similarity is introduced in [29] that directly depends upon the generalization of an external pre-trained network for optimizing the generator. In summary, such methods typically suffer from loss of minute details [16] or distorted boundaries [29]. Moreover, all of these methods perform the required up-scaling at once at the end of the network. Thus, any possible distortion during the drastic up-scaling cannot be mitigated by the network. Furthermore, the ever-improving GAN variants remain mostly unexplored in the context of super-resolution.

SAN [30] introduces channel attention in CNN to open the gate for transformer networks in super-resolution. In support of a better performance, transformers not only manage to adaptively mix diversely informative features through attention [17, 31] but also mitigate SR output distortions using layer normalization. In [32], the idea of cross-attention is proposed which was later improved in [33] to mitigate the adverse impacts of uncontrolled mixing of distinct features. The DAT [34] introduces a novel transformer model that aggregates features through inter-block spatial and intra-block channel attentions. In essence, they introduce Adaptive Interaction Module (AIM) and the Spatial-Gate Feed-Forward network (SGFN) for a tailored feature aggregation at different level. Later, SR-Former [35] attempts to improve [34] by focusing on Permuted Self Attention (PSA) for a more balanced approach towards feature aggregation through channel and spatial attentions.

III Proposed Method

Typically, image matrix (or vector if flattened) resides in lower-dimensional ambient space $\mathcal{M}_{1}$ [36] i.e. a low-resolution (LR) image $\mathbf{x}\in\mathcal{M}_{1}$ . Thus, a higher-resolution (HR) version of $\mathbf{x}$ exists as $\mathbf{y}\in\mathcal{M}_{2}$ that improves definition. The estimated form of $\mathbf{y}$ is known as the super-resolution output SR. Commonly $\mathcal{M}_{1}\subset\mathbb{R}^{whc}$ and $\mathcal{M}_{2}\subset\mathbb{R}^{w^{\prime}h^{\prime}c}$ , where $w^{\prime}=wr$ , $h^{\prime}=hr$ , and $r\in\mathbb{Z}^{+}$ is the multiplicative scaling factor [37] denoting the extent of magnification from LR to HR (or SR). A GAN-based super-resolution method given input $\mathbf{x}$ searches for a generator $G\in\{G_{\theta}:\mathcal{M}_{1}\rightarrow\mathcal{M}_{2}|\theta\in\Theta\}$ that minimizes the discrepancy between SR $G(\mathbf{x})$ and its HR analog $\mathbf{y}$ . A discriminator $D$ guides $G$ by providing feedback through distinguishing $\mathbf{y}$ and $G(\mathbf{x})$ . We further denote the distributions of LR, HR, and SR as $p_{\mathbf{x}}$ , $p_{\mathbf{y}}$ , and $p_{G(\mathbf{x})}$ respectively.

III-A The Architecture of $G$

The generator $G$ aims to recover $\mathbf{y}$ from $\mathbf{x}$ under the commonly used constraint of $r=4$ i.e. through 4x super-resolution [6, 8, 26, 17, 32]. Unlike popular practice [17, 32] of performing 4x up-scaling in one shot at the end, $G$ in SuRGe performs the same in two steps i.e. a 2x up-scaling (see Figure 2) at the end of each half. This way, the right half of $G$ can mitigate the possible abrupt distortions of the feature space due to the first 2x up-scaling and reuse features from the left half (through skip connections) to recover corrupted information. We demonstrate $G$ in Figure 2 highlighting the key components while detailing them individually in the following.

The initial convolution block $(C_{0})$ extracts the low-level features using larger kernels with half-padding, providing two benefits. (1) Repetitive information and their variation over a larger region can be better captured [38]. (2) Possible distortions near the image boundaries can be avoided [6]. Moreover, we use parametric ReLU to allow the distinct layers to have different non-linearity for better conservation of low-level features. Further, we discard normalization to avoid information loss through regularization and scaling.

The repetitive residual generator block $(R_{0})$ focus on extracting high-level intricate features using smaller kernels. This contains $n_{G}$ residual blocks [11], each having two sub-blocks. The first sub-block alone uses parametric ReLU activation while both employ batch normalization to induce regularization and limit covariance shift. The outputs of the two sub-blocks are added through a skip connection and passed to the next residual block. The inter-sub-block skip connection ensures that the features after each convolution and normalization at least retain the extracted information, if not able to enrich it further.

The $n_{G}$ -th block of $R_{0}$ sums the outputs of its two sub-blocks. This distorted feature space, similar to a typical ResNet, must be further stabilized before being processed in the next stage. However, the common remedy of average pooling fails in super-resolution because neither such a kernel is learned nor the down-scaling goes along with the task objective. Therefore, the intermediate convolution block $(I_{0})$ is added to stabilize the output of $R_{0}$ by additional convolutions with batch normalization.

At the outputs of $C_{0}$ and $I_{0}$ , we respectively have low and high level features. Thus, we use the first weighted feature mixing module $F_{0}$ to combine these two features before up-scaling. $F_{0}$ performs a simple convex combination as:

F_{0}=w^{(F_{0})}_{1}C_{0}+w^{(F_{0})}_{2}I_{0}(R_{0}(C_{0})),

(1)

where $w^{(F_{0})}_{1},w^{(F_{0})}_{2}>0$ and $w^{(F_{0})}_{1}+w^{(F_{0})}_{2}=1$ . We learn both of $w^{(F_{0})}_{1}$ and $w^{(F_{0})}_{2}$ as parameters of $G$ while the convexity constraint is ensured by passing the weights through a Softmax activation.

The output of $F_{0}$ is first stabilized with convolution and then passed to $U_{0}$ for 2x up-scaling. The commonly used up-scaling techniques such as PixelShuffle and transposed convolutions, though effective otherwise, are likely to distort SR as in Figure 3. This is because the overlapped kernels may introduce uneven convolution that results in higher frequency color patterns like a checkerboard in the border pixels of the kernel mapping. Hence, we employ the interpolation-based nearest neighbors method [1] for up-scaling.

Except for $F_{1}$ , the rest of the right half of $G$ is identical to the left. The right or second half of $G$ starts with an initial convolution layer $C_{1}$ that stabilizes the $U_{0}$ output. In an attempt to recover any lost or corrupted information at this stage, we add the output of $C_{1}$ with the 2x up-sampled output of $C_{0}$ after passing it through a skip block $S$ . The sum is then propagated through $R_{1}$ and $I_{1}$ layers. $F_{1}$ takes the following three inputs: (1) $C_{0}$ output 2x up-scaled by a skip block $S$ having a structure similar to $U\circ C$ that recovers the low-level features at the end of the network. (2) The output of $I_{1}$ . (3) The output of $C_{1}$ . Similar to $F_{0}$ here also we perform a convex combination of the three inputs as follows:

F_{1}=w^{(F_{1})}_{1}I_{1}+w^{(F_{1})}_{2}C_{1}+w^{(F_{1})}_{3}S(C_{0}),

(2)

where the three weights $w^{(F_{1})}_{1}$ , $w^{(F_{1})}_{2}$ , and $w^{(F_{1})}_{3}$ are constrained with convexity similarly to their counterparts in $F_{0}$ and are thus learned in the same way. The output of $F_{1}$ is 2x up-scaled by $U_{1}$ , and stabilized by further convolutions (without batch normalization or activation to avoid distortion) to produce the SR output $G(\mathbf{x})$ .

III-B The Architecture of $D$

As demonstrated in Figure 2, $D$ has two main components: a sub-network called Repetitive Residual Discriminator Block $B$ and a classification head $H$ . The structure of $B$ mostly follows $R$ as it contains $n_{D}$ residual blocks, each with two sub-blocks connected by an inter-sub-block skip connection. Maintaining near structural similarity between $G$ and $D$ enables the same input to likely have close embeddings in the learned feature space. Thus, $D$ can easily identify a deviation of $G(\mathbf{x})$ from $\mathbf{y}$ and improve $G$ through a more useful feedback.

There are three key differences between $R$ and $B$ . (1) $B$ uses LeakyReLU activation to prevent sparse or scattered gradients [7]. (2) $B$ employs Pixel normalization [39], as batch normalization is known to cause quality issues in a super-resolution task when used in $D$ [10, 26]. (3) The number of filters and the convolution kernel size are gradually increased over the residual blocks. This not only improves the balanced capture of low-level and high-level information but also prevents over-fitting by removing bias to a particular kernel size.

The classification head $H$ first performs an adaptive average pooling on the output of $B$ . The pooled features are then flattened and passed through dense layers with LeakyReLU activation. The final dense layer maps the features to a single node and applies Sigmoid activation on the logit to find the probability of the input being HR ground truth.

III-C Loss functions of SuRGe

SuRGe, embodying the GAN philosophy, has tailor-made losses for the generator $G$ and the discriminator $D$ .

III-C1 Loss for generator $G$ :

To receive guidance from $D$ , $G$ utilizes a traditional adversarial loss $\mathcal{L}^{G}_{a}$ defined as:

\mathcal{L}^{G}_{a}=-\sum\nolimits_{\mathbf{x}\in N}\log{D(G(\mathbf{x}))},

(3)

where $N$ is a training batch.

The classical $\mathcal{L}^{G}_{a}$ , though necessary, is not sufficient for maintaining the desired perceptual quality of the SR output, when deployed alone. A common remedy [8, 27] is to additionally minimize the discrepancy between HR and SR in the embedding space of a pre-trained deep network that is likely capable of expressing perceptual information. Evidently, the efficacy of such a loss is reliant on the quality and generalizability of the pre-trained embedding space [26]. However, $p_{\mathbf{y}}$ and $p_{G(\mathbf{x})}$ , respectively the distributions of HR and SR, in practice are supported on the same ambient space. Hence, directly minimizing their divergence using a symmetric measure like JS motivates $G(\mathbf{x})$ to resemble $\mathbf{y}$ :

\mathcal{L}^{G}_{\textrm{JS}}=\frac{1}{2}\mathbb{E}_{p_{\mathbf{y}}}\left[\log% (p_{\mathbf{y}})-\log\left(\frac{(p_{\mathbf{y}}+p_{G(\mathbf{x})})}{2}\right)% \right]+\frac{1}{2}\mathbb{E}_{p_{G(\mathbf{x})}}\left[\log(p_{G(\mathbf{x})})% -\log\left(\frac{1}{(p_{G(\mathbf{x})}+p_{\mathbf{y}})}{2}\right)\right].

(4)

As the name suggests, at the heart of the super-resolution problem lies the task of learning a meaningful transformation $G$ that refines LR images visually. The optimization, however, is constrained based on the need to preserve semantic features. Such information in a set of samples is stored not only in coordinate entries of the vectors but also into their local geometry. As such, a generative model becomes a true SR architecture on the basis of its capacity to keep the metric measure spaces corresponding to $p_{\mathbf{x}}$ and $p_{G(\mathbf{x})}$ near-isometric. The divergence that enables penalizing the deviation from such an ideal scenario is GW. Thus in SuRGe, we integrate the GW loss in training $G$ :

\mathcal{L}^{G}_{\textrm{GW}}=\min_{\gamma\in\Gamma}\int|d_{1}(\mathbf{x},% \mathbf{\tilde{x}})-d_{2}(\mathbf{z},\mathbf{\tilde{z}})|^{2}d\gamma(\mathbf{x% },\mathbf{z})d\gamma(\mathbf{\tilde{x}},\mathbf{\tilde{z}}),

(5)

where $\Gamma$ is the set of couplings between distributions $p_{\mathbf{x}}$ and $p_{G(\mathbf{x})}$ , while $\mathbf{x},\bar{\mathbf{x}}\sim p_{\mathbf{x}}$ , and $\mathbf{z},\bar{\mathbf{z}}\sim p_{G(\mathbf{x})}$ . Also, $d_{1},d_{2}$ are the metrics on the spaces $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ respectively.

Tuning a set of static weights to combine the three loss components in $\mathcal{L}_{G}$ is not only tedious but also inefficient due to being oblivious to dynamic training situations. Learning the weights as network parameters may also bias the training towards a particular component. Thus, we employ a convex combination of the three loss components where the weights are dynamically calculated [20]. Specifically, the values of the three loss components are passed through Softmax. As such, the dynamically assigned weight to a loss component depends on its value such that at any point in time, the weights adjust according to the values for preventing the dominance of one on the others in the combined $\mathcal{L}^{G}$ . In essence, at each iteration of training:

\mathcal{L}^{G}=w_{a}\mathcal{L}_{a}^{G}+w_{\textrm{JS}}\mathcal{L}_{\textrm{% JS}}^{G}+w_{\textrm{GW}}\mathcal{L}_{\textrm{GW}}^{G},\;\text{where}\;w_{(% \cdot)}=\frac{\exp({\mathcal{L}_{(\cdot)}^{G}})}{(\exp({\mathcal{L}_{a}^{G}})+% \exp({\mathcal{L}_{\textrm{JS}}^{G}})+\exp({\mathcal{L}_{\textrm{GW}}^{G}}))}.

(6)

Now $w_{(\cdot)}$ can be $w_{a}$ , $w_{\textrm{JS}}$ and $w_{\textrm{GW}}$ while $\mathcal{L}_{(\cdot)}^{G}$ is respectively set to $\mathcal{L}_{a}^{G}$ , $\mathcal{L}_{\textrm{JS}}^{G}$ , and $\mathcal{L}_{\textrm{GW}}^{G}$ .

III-C2 Loss for discriminator $D$ :

We draw inspiration from WGANs’ promise of improving generation quality by deploying the Wasserstein- $1$ distance (WD) to distinguish between ‘real’ and ‘fake’ samples. The underlying class of critic functions (them being $k$ -Lipschitz continuous) additionally mollify mode collapse [40]. However, maintaining $k$ -Lipschitz continuity during training is difficult as it requires limiting the gradients of $D$ . WGAN achieves this by a weight-clipping heuristic that sacrifices complexity. As a better alternative, WGAN-GP puts a constraint on the gradient itself that can be expressed as a regularizer called gradient penalty. Thus, $\mathcal{L}^{D}$ can be written as follows:

\mathcal{L}^{D}=\mathbb{E}_{p_{\mathbf{x}}}D(G(\mathbf{x}))-\mathbb{E}_{p_{% \mathbf{y}}}D(\mathbf{y})+\lambda\mathbb{E}(||\nabla_{\hat{\mathbf{x}}}D(\hat{% \mathbf{x}})||_{2}-1)^{2},

(7)

where $\hat{\mathbf{x}}=\epsilon\mathbf{y}+(1-\epsilon)G(\mathbf{x})$ , and $\epsilon\sim\textrm{Uniform}(0,1)$ .

III-D Putting it all together

The workflow of SuRGe is illustrated in Figure 4 while the algorithm is described in Algorithm 1 in Algorithm of SuRGe. We follow a patch-based training [37, 8] in SuRGe. The idea is to extract $256\times 256$ overlapped patch of the HR ground truth as $\mathbf{y}$ and 4x bi-cubic down-scale the same to $64\times 64$ to get the corresponding LR as $\mathbf{x}$ . The training strategy of SuRGe is similar to a vanilla GAN [7]. Thus, $G$ and $D$ are alternatively updated with the gradients of the respective $\mathcal{L}^{G}$ and $\mathcal{L}^{D}$ loss.

IV Experiments

IV-A Experimental Protocol

Following standard practice, we train SuRGe on DIV2K [21] dataset using the 800 training examples. We test SuRGe on four popular benchmarks namely Set5 [41], Set14 [42], BSD100 [43], and Urban100 [4] along with six additional datasets viz. Kitti2012, Kitti2015 [44], Middlebury [45], PIRM [46], OST300 [47] and MANGA109 [48]. The details of network architecture, datasets, pre-processing of input, the hyper-parameters choices, and their tuning with grid search are provided respectively in Detailed architecture of the SuRGe network, Description of datasets, and Network architecture selection by grid search. We use PSNR [23] and SSIM [24] to measure the performances of all the methods both of which are described in Metrics. The code for SuRGe is currently provided as a supplementary archieved file to this article, once accepted for publication the code-base will be uploaded to a public GitHub repository for ease of access and result reproduction.

TABLE I: Ablation study of SuRGe on BSD100 dataset in terms of PSNR and SSIM. The gradual improvement in performance with the progressive addition of key components through five intermediate models

V_{0}-V_{4}

validate their importance in SuRGe.

Model	$D_{\textrm{VGG}}$	$D_{\textrm{RES}}$	$D_{3\textrm{K}}$	$D_{\textrm{IK}}$	$\mathcal{L}^{G}_{p}$	$\mathcal{L}^{G}_{\textrm{JS}}$	$\mathcal{L}^{G}_{\textrm{GW}}$	DS	$G_{\textrm{CC}}$	$G_{F_{0,1}}$	$\mathcal{W}^{G}_{t}$	$\mathcal{W}^{G}_{l}$	$\mathcal{W}^{G}_{dw}$	PSNR ${}^{1}$	SSIM ${}^{1}$
$V_{0}$	✓		✓		✓				✓		✓			$29.61$	$0.76$
$V_{1}$		✓	✓		✓				✓		✓			$30.04$	$0.81$
$V_{2}$		✓		✓	✓				✓		✓			$30.14$	$0.81$
$V_{3}$		✓		✓		✓		RN		✓	✓			$29.85$	$0.83$
$V_{4}$		✓		✓		✓		FI		✓	✓			$30.99$	$0.84$
$V_{5}$		✓		✓		✓	✓	RN		✓		✓		$30.16$	$0.83$
$V_{6}$		✓		✓		✓	✓	FI		✓		✓		$31.29$	$0.86$
SuRGe		✓		✓		✓	✓	FI		✓			✓	31.52	0.87

•

$D_{\textrm{VGG}}$ : $D$ with VGG-type backbone. $D_{\textrm{RES}}$ : $D$ with ResNet-type network. $D_{3\textrm{K}}$ : Kernel size is set to 3 in $D$ . $D_{\textrm{IK}}$ : $D$ with incremental kernel size. $\mathcal{L}^{G}_{p}$ : Perceptual similarity calculated with ResNet-50 is used as a loss. $\mathcal{L}^{G}_{\textrm{JS}}$ : Jensen-Shannon Divergence calculated between SR and HR image batch. $\mathcal{L}^{G}_{\textrm{GW}}$ : Gromov-Wasserstein Loss calculated between LR and SR image batch. DS: Space used for divergence measures, can be ResNet-50 (RN) or Flattened Image (FI). $G_{\textrm{CC}}$ : During combinations the feature from the main path is taken entirety while the skip connection weights are manually tuned. $G_{F_{0,1}}$ : $G$ using the $F_{0,1}$ in SuRGe. $\mathcal{W}^{G}_{t}$ : Summing $\mathcal{L}^{G}_{a}$ , $\mathcal{L}^{G}_{\textrm{JS}}$ , $\mathcal{L}^{G}_{\textrm{GW}}$ in the Generator Loss $\mathcal{L}^{G}$ . $\mathcal{W}^{G}_{l}$ : Mixing $\mathcal{L}^{G}_{a}$ , $\mathcal{L}^{G}_{\textrm{JS}}$ , $\mathcal{L}^{G}_{\textrm{GW}}$ in the Generator Loss $\mathcal{L}^{G}$ with weights learned as parameters of the network. $\mathcal{W}^{G}_{dw}$ : Mixing $\mathcal{L}^{G}_{a}$ , $\mathcal{L}^{G}_{\textrm{JS}}$ , $\mathcal{L}^{G}_{\textrm{GW}}$ with the dynamic weighting in the Generator Loss $\mathcal{L}^{G}$ . ${}^{1}$ : Increment indicates improvement.

IV-B Ablation study

We start with an ablation study of the five critical components in SuRGe, namely the choice of backbone in $D$ , the kernel size for convolution in $D$ , the choice of loss functions in $G$ , the combination strategy of $F_{0,1}$ in $G$ , and the loss function in $D$ . Table I shows that over the seven intermediate models on the BSD100 dataset, the performance gradually improves in terms of PSNR and SSIM with better choices for the components. The best performance is achieved when all the components act in harmony, validating their importance in SuRGe. Moreover, previous uses of GW [49] argued in favor of representations obtained from a pre-trained deep network to limit the dimensions and improve stability. However, the particular task of super-resolution may benefit from the flattened images as that mitigate the risk of unregulated information alteration through feature extraction. We empirically confirm this through a comparison of $V_{3}$ and $V_{5}$ passing image features extracted from a pre-trained ResNet-50 against $V_{4}$ and $V_{6}$ feeding flattened images to the $\mathcal{L}^{G}_{\textrm{JS}}$ and $\mathcal{L}^{G}_{\textrm{GW}}$ . Our experiment shows that using flattened images gives performance boost compared to ResNet-50 embedding space. As an extension, in Figure 5, we further show that the SR output generated by SuRGe progressively improves over training.

TABLE II: Performance comparison of SuRGe in terms of PSNR and SSIM on four benchmarks against notable competitors.

${}^{1}$ Boldfaced: best, Underlined: second best. ${}^{2}$ CNNA: CNN+Attention, TRAN: Transformer.
Method	Strategy ${}^{2}$	Set5 ${}^{1}$		Set14 ${}^{1}$		BSD100 ${}^{1}$		Urban100 ${}^{1}$
Method	Strategy ${}^{2}$	PSNR ${}^{3}$	SSIM ${}^{3}$	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
SRCNN	CNN	30.49	0.86	27.50	0.75	26.91	0.71	24.53	0.72
SelfExSR	CNN	--	--	--	--	26.80	0.71	24.67	0.73
DBPN-RES-MR64-3	CNN	32.65	0.90	29.03	0.79	27.82	0.74	27.08	0.81
SRGAN	GAN	29.40	0.85	26.02	0.74	23.16	0.67	--	--
ProSR-L	GAN	--	--	28.94	--	27.68	--	26.74	--
ESRGAN	GAN	32.73	0.90	28.99	0.79	27.85	0.75	27.03	0.82
RankSRGAN	GAN	--	--	26.57	0.65	25.57	0.65	--	--
Beby-GAN	GAN	27.82	0.80	26.96	0.73	25.81	0.68	25.72	0.77
Gram-GAN	GAN	27.97	0.80	26.96	0.77	26.32	0.74	25.89	0.77
SAN	CNNA	32.70	0.90	29.05	0.79	27.86	0.75	27.23	0.82
WRAN	CNNA	28.60	0.90	28.60	0.79	27.71	0.74	26.74	0.80
SwinIR	TRAN	32.92	0.90	29.09	0.79	27.92	0.74	27.45	0.82
SwinIR+	TRAN	32.93	0.90	29.15	0.79	27.95	0.75	27.56	0.83
SwinFIR	TRAN	33.20	0.91	29.36	0.79	28.03	0.75	28.12	0.84
LTE	TRAN	32.81	--	29.06	--	27.86	--	27.24	--
HAT-L	TRAN	33.30	0.90	29.47	0.80	28.09	0.76	28.60	0.85
DAT+	TRAN	33.15	0.91	29.29	0.80	28.03	0.75	27.99	0.84
SRFormer+	TRAN	33.09	0.91	29.19	0.80	28.00	0.75	27.85	0.84
SuRGe (Ours)	GAN	33.07	0.91	30.21	0.83	31.52	0.87	30.11	0.90
${}^{3}$ Increment indicates improvement.

TABLE III: Performance comparison of SuRGe in terms of PSNR and SSIM on six additional test datasets.

Dataset	Method	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )
PIRM	ESRGAN+	24.15	--
	RankSRGAN	25.62	--
	SuRGe (Ours)	31.92	0.90
OST300	ESRGAN+	23.80	--
OST300	SuRGe (Ours)	31.01	0.86
Kitti2012	NAFSSR-L	27.12	0.82
	SwinFIR	26.83	0.81
	SuRGe (Ours)	32.31	0.88
Kitti2015	NAFSSR-L	26.96	0.82
	SwinFIR	26.00	0.80
	SuRGe (Ours)	31.12	0.89
Middleburry	NAFSSR-L	30.20	0.85
	SwinFIR	30.01	0.86
	SuRGe (Ours)	35.72	0.93
MANGA109	SRCNN	27.66	0.86
	DBPN-RES-MR64-3	31.74	0.92
	HAT-L	33.09	0.93
	HAT	32.87	0.93
	SwinFIR	32.83	0.93
	SwinIR+	32.22	0.92
	SAN	31.66	0.92
	SuRGe (Ours)	34.17	0.95

•

The best result is boldfaced, while the second best is underlined.

IV-C Quantitative performance of SuRGe

In Table II we exhibit the efficacy of the proposed SuRGe on Set5, Set14, BSD100, and Urban100 benchmarks in terms of PSNR and SSIM. For comparison, we select 16 state-of-the-art methods from 4 groups. (1) CNN-based: SRCNN [6], SelfExSR [4], and DBPN-RES-MR64-3 [50]. (2) GAN-based: SRGAN [8], ProSR-L [37], ESRGAN [26], RankSRGAN [51], Beby-GAN [29], and GramGAN [52]. (3) CNN with attention-based: SAN [30] and WRAN [53]. (4) Transformer-based: SwinIR and SWIN-IR+ [17], LTE [18], SwinFIR [31], and HAT-L [33]. Table II shows that except Set5, on all the other datasets, SuRGe performs better in terms of both indices, attesting to its consistency. In terms of percentage points (pp), SuRGe improves the PSNR and SSIM metrics respectively with 1.89pp and 6.3pp, on average on Set14, BSD100, and Urban100. In Set5, even though SuRGe achieves the best SSIM jointly with the transformer-based SwinFIR, the PSNR is slightly lower than two competitors. This may be attributed to the exceptionally smaller LR images in Set5. Such LR image contains more intricate details in a lesser number of pixels. Consequently, the SR output of SuRGe retains most of the high-level visual similarity to attain a commendable SSIM, while the loss of some details is apparent from the slightly lower PSNR.

We further evaluate the efficacy of SuRGe on Kitti2012, Kitti2015, Middlebury, PIRM, OST300, and MANGA109. We compare the performance of SuRGe in terms of PSNR and SSIM in Table III against ten notable contenders viz. SRCNN, DBPN-RES-MR64-3, ESRGAN+ [27], SAN, RankSRGAN [51], NAFSSR-L [32], SwinIR+, SwinFIR, HAT [33], and HAT-L. We see from Table III that SuRGe achieves better PSNR and SSIM on all six datasets. This establishes the power of SuRGe in consistently generating better quality SR outputs.

IV-D Qualitative comparison

We compare the visual quality of SuRGe against BSRGAN [16], SRGAN, ESRGAN [26], Real-ESRGAN [28], LTE, and SwinIR in Figure 6. From Figure 6, we can make three key observations. (1) SRGAN, LTE, and SWIN-IR output comparatively blurry SR than SuRGe. (2) ESRGAN and Real-ESRGAN, though preserve finer details may add more distortion and noise to SR. This is apparent from the mustache of the baboon, the eyebrow of the child, and the nails of the comic. (3) BSRGAN provides smooth, apparently attractive SR outputs but fails to conserve details to the limit of SuRGe. Thus, SuRGe produces better and more detailed SR outputs closer to the HR ground truths. Additional qualitative results along with quantitative support in favor of SuRGe can be found in Additional Results.

V Conclusion and Future Works

We propose a fully-convolutional GAN-based SuRGe that generates visually attractive $4$ x super-resolution images with minute details. SuRGe highlights the need for diversely informative feature preservation and their combination in a learnable fashion in super-resolution task. Further, possibly for the first time SuRGe successfully applies divergence measures such as GW as loss functions in the super-resolution context. Moreover, SuRGe demonstrates how the choices of kernel size, normalization methods, and the location and strategy of up-scaling impact the quality of the generated SR output. Furthermore, the commendable performance of SuRGe comes at a smaller model (in compariosn to GAN-based SR methods) with a low inference time, as shown in Additional Results. The use of divergence measures though considerably improves the performance of SuRGe they may also fall prey to noise present in the images, somewhat compromising the robustness in the process. This may be mitigated in the future by either exploring the applicability of robust divergence measures [54] or incorporating remedial techniques like median of means [55]. Moreover, the currently proposed architecture is tailored for 4x super-resolution as that is the most common and widely popular variant of the task. Generalizing the proposed model to a $r$ x super-resolution where $r$ is even, likely will not pose a significant challenge, though additional correction of up-sampled output may be required with increasing $r$ . However, the case may become even more complicated if $r$ becomes odd. A potential remedy can be in the form of a robust and adaptive up-sampling strategy that will incur minimal distortion and preferably operate in a self-correcting mode to enable a better super-resolution output.

References

[1] O. Rukundo and H. Cao, “Nearest neighbor value interpolation,” arXiv preprint arXiv:1211.1768, 2012.
[2] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in European Conference on Computer Vision, 2014, pp. 372–386.
[3] P. S. Parsania and P. V. Virparia, “A comparative analysis of image interpolation algorithms,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 5, no. 1, pp. 29–34, 2016.
[4] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
[5] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1664–1673.
[6] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2015.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[8] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
[9] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016.
[10] Y. Wu and J. Johnson, “Rethinking ”batch” in batchnorm,” arXiv preprint arXiv:2105.07576, 2021.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[12] ——, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE/CVF International Conference on Computer Vision, 2015, pp. 1026–1034.
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer Assisted Intervention, 2015, pp. 234–241.
[14] C. Zhang, F. Rameau, S. Lee, J. Kim, P. Benz, D. M. Argaw, J.-C. Bazin, and I. S. Kweon, “Revisiting residual networks with nonlinear shortcuts.” in British Machine Vision Conference, 2019, p. 12.
[15] C. Zhang, P. Benz, D. M. Argaw, S. Lee, J. Kim, F. Rameau, J.-C. Bazin, and I. S. Kweon, “Resnet or densenet? introducing dense shortcuts to resnet,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3550–3559.
[16] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1604–1613.
[17] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in IEEE/CVF International Conference on Computer Vision (Workshop), 2021, pp. 1833–1844.
[18] J. Lee and K. H. Jin, “Local texture estimator for implicit representation function,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1929–1938.
[19] F. Mémoli, “Gromov–wasserstein distances and the metric approach to object matching,” Foundations of computational mathematics, vol. 11, pp. 417–487, 2011.
[20] S. Datta, S. S. Mullick, A. Chakrabarty, and S. Das, “Interval bound interpolation for few-shot learning with few tasks,” in International Conference on Machine Learning, 2023, pp. 7141–7166.
[21] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2017, pp. 1122–1131.
[22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[23] A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in IEEE International Conference on Pattern Recognition, 2010, pp. 2366–2369.
[24] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[25] J. Xin, J. Li, X. Jiang, N. Wang, H. Huang, and X. Gao, “Wavelet-based dual recursive network for image super-resolution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, pp. 707–720, 2020.
[26] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in European Conference on Computer Vision (Workshop), 2018, pp. 1–16.
[27] N. C. Rakotonirina and A. Rasoanaivo, “Esrgan+: Further improving enhanced super-resolution generative adversarial network,” in IEEE ICASSP, 2020, pp. 3637–3641.
[28] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.
[29] W. Li, K. Zhou, L. Qi, L. Lu, and J. Lu, “Best-buddy gans for highly detailed image super-resolution,” in AAAI Conference on Artificial Intelligence, 2022, pp. 1412–1420.
[30] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for magnification-arbitrary single image super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 065–11 074.
[31] D. Zhang, F. Huang, S. Liu, X. Wang, and Z. Jin, “Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution,” arXiv preprint arXiv:2208.11247, 2022.
[32] X. Chu, L. Chen, and W. Yu, “Nafssr: stereo image super-resolution using nafnet,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1239–1248.
[33] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more pixels in image super-resolution transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 367–22 377.
[34] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yang, and F. Yu, “Dual aggregation transformer for image super-resolution,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 12 312–12 321.
[35] Y. Zhou, Z. Li, C.-L. Guo, S. Bai, M.-M. Cheng, and Q. Hou, “Srformer: Permuted self-attention for single image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 780–12 791.
[36] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein, “The intrinsic dimension of images and its impact on learning,” in International Conference on Learning Representations, 2021.
[37] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers, “A fully progressive approach to single-image super-resolution,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2018, pp. 864–873.
[38] S. Bianco, C. Cusano, and R. Schettini, “Color constancy using cnns,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Workshop), 2015, pp. 81–89.
[39] T. Karras, T. Aila et al., “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
[40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
[41] M. Bevilacqua, A. Roumy, C. Guillemot, and M. line Alberi Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in British Machine Vision Conference, 2012, pp. 135.1–135.10.
[42] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces, 2012, pp. 711–730.
[43] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in IEEE/CVF International Conference on Computer Vision, 2001, pp. 416–423.
[44] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
[45] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, pp. 7–42, 2002.
[46] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 pirm challenge on perceptual image super-resolution,” in European Conference on Computer Vision (Workshop), 2018, pp. 1–22.
[47] X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
[48] Y. Matsui, K. Ito et al., “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, pp. 21 811–21 838, 2017.
[49] C. Bunne, D. Alvarez-Melis, A. Krause, and S. Jegelka, “Learning generative models across incomparable spaces,” in International Conference on Machine Learning, 2019, pp. 851–861.
[50] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for single image super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4323–4337, 2021.
[51] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “Ranksrgan: Generative adversarial networks with ranker for image super-resolution,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 3096–3105.
[52] J. Song, H. Yi, W. Xu, B. Li, and X. Li, “Gram-gan: Image super-resolution based on gram matrix and discriminator perceptual loss,” Sensors, vol. 23, no. 4, 2023.
[53] S. Xue, W. Qiu, F. Liu, and X. Jin, “Wavelet-based residual attention network for image super-resolution,” Neurocomputing, vol. 382, pp. 116–126, 2020.
[54] Y. He, A. B. Hamza, and H. Krim, “A generalized divergence measure for robust image registration,” IEEE Transactions on Signal Processing, vol. 51, no. 5, pp. 1211–1220, 2003.
[55] G. Lecué and M. Lerasle, “Robust machine learning by median-of-means: Theory and practice,” The Annals of Statistics, vol. 48, no. 2, pp. 906 – 931, 2020.

Detailed architecture of the SuRGe network

Refer to Figure 7 and 8 for the detailed schematic description of the respective architectures of Generator $G$ and Discriminator $D$ in SuRGe.

-A Architecture of Generator

We present the architecture of Generator $G$ in SuRGe. Taking a random LR patch from Baboon as an example input, the model constructed with $n_{G}=8$ outputs a SR patch by passing it through different blocks as indicated by the legend.

-B Architecture of Discriminator

We demonstrate the discriminator model $D$ of SuRGe in Figure 8. All blocks presented in the figure follows the same naming convention discussed in the Section III-B of the main paper. Discriminator $D$ uses $n_{D}=4$ along with a architecture of sub-blocks that is structurally similar to the generator $G$ except the presence of normalisation. The classification head $H$ is responsible for the distinguishing between real (HR) and fake (SR) image samples.

Algorithm of SuRGe

The following Algorithm 1 describes the workflow of SuRGe.

Algorithm 1 Super-Resolution Generator (SuRGe)

Input: $Y^{GT}=\{\mathbf{y}^{GT}_{1},\mathbf{y}^{GT}_{2},\cdots,\mathbf{y}^{GT}_{m}\}$ : training set of full HR Ground Truth (GT) images, $N$ : mini-batch size, $T$ : Number of epochs as a termination criterion.
Output: A trained super-resolution image generator network $G$ .

1: Initialize epoch counter

t=1

2: while

t\leq T

3: Initialize

Y=\phi

4: for Each of

\mathbf{y}^{GT}\in Y^{GT}

Y=Y\cup\{\mathbf{y}\}

, where

\mathbf{y}

is a

256\times 256

patch, randomly extracted from

\mathbf{y}^{GT}

6: end for

7: Sample

Y_{N}=\{\mathbf{y}_{1},\mathbf{y}_{2},\cdots,\mathbf{y}_{N}\}\subset Y

of a batch of HR ground truth patches.

8: Form

X_{N}=\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}

of LR training batch where

\mathbf{x}_{i}

is formed by down-scaling

\mathbf{y}_{i}

64\times 64

by bi-cubic interpolation for all

i=1,2,\cdots.N

9: Update

G

by gradient descent on

\mathcal{L}^{G}(X_{N},Y_{N},G(X_{N}))

10: Sample SR output batch

G(X_{N})=\{G(\mathbf{x}_{1}),G(\mathbf{x}_{2}),\cdots,G(\mathbf{x}_{N})\}

where

\mathbf{x}_{i}\in X_{n}

11: Update

D

by gradient descent on

\mathcal{L}^{D}(Y_{N},G(X_{N}))

12: Increase epoch counter

t=t+1

13: end while

Description of datasets

The following Table IV provides the details of the DIV2K training dataset [21] and the 10 benchmark testing datasets.

TABLE IV: Details of datasets

Dataset	Number of	Average ground	Remark
	samples	truth resolution
DIV2K [21]	$800$	$1971\times 1435$	Only the training split is used to train SuRGe.
Set5 [41]	$5$	$78\times 84$	Benchmark, contains 5 samples in total.
Set14 [42]	$14$	$112\times 101$	Benchmark, contains 14 samples in total.
BSD100 [43]	$100$	$110\times 89$	Benchmark, contains 100 samples in total.
Urban100 [4]	$100$	$246\times 199$	Benchmark, contains 100 samples in total.
PIRM [46]	$100$	$155\times 119$	contains 100 samples from validation set.
KITTI2012 [44]	$20$	$300\times 96$	The same 20 samples are taken as in [32].
KITTI2015 [44]	$20$	$300\times 96$	The same 20 samples are taken as in [32].
Middlebury [45]	$10$	$438\times 312$	The same 10 samples are taken as in [32].
OST300 [47]	$300$	$160\times 123$	contains 300 samples in total.
MANGA109 [48]	$109$	$207\times 300$	contains 109 samples in total.

Metrics

We use two metrics to quantify and compare the performance of the super-resolution methods.

-C Peak Signal to Noise Ratio (PSNR)

PSNR [23] is a measure to quantify the noise present in an image compared to a reference. In case of super-resolution, if the ground truth is $\mathbf{y}$ and the SR output generated by $G$ from a LR input $\mathbf{x}$ is $G(\mathbf{x})$ , then the PSNR ( $\rho$ ) between the two is defined as:

\rho(\mathbf{y},G(\mathbf{x}))=20\log_{10}\left[\frac{\max\{\mathbf{y}\}}{% \frac{1}{wh}||\mathbf{y}-G(\mathbf{x})||^{2}_{2}}\right],

(8)

where $w$ and $h$ respectively denote the width and height or $\mathbf{y}$ oder $G(\mathbf{x})$ . Evidently, in equation (8) we want to decrease $||\mathbf{y}-G(\mathbf{x})||^{2}_{2}$ so that the SR output matches with the HR ground truth. In other words, a higher PSNR indicates a better quality SR.

-D Structural Similarity Index (SSIM)

SSIM [24] is another commonly used metric to compare an SR output to a reference HR ground truth. Similar to PSNR, given a SR output $G(\mathbf{x})$ and a HR ground truth $\mathbf{y}$ the SSIM $\lambda$ between the two is calculated as follows:

\lambda(\mathbf{y},G(\mathbf{x}))=\frac{\left(2\mu_{\mathbf{y}}\mu_{G(\mathbf{% x})}+c_{1}\right)\left(2Cov(\mathbf{y},G(\mathbf{x}))+c_{2}\right)}{\left(\mu_% {\mathbf{y}}^{2}+\mu^{2}_{G\left(\mathbf{x}\right)}+c_{1}\right)\left(\sigma_{% \mathbf{y}}^{2}+\sigma^{2}_{G\left(\mathbf{x}\right)}+c_{2}\right)},

(9)

where, $\mu_{\mathbf{y}}$ and $\mu_{G(\mathbf{x})}$ are respectively the mean pixel values of HR ground truth and SR output, $\sigma_{\mathbf{y}}^{2}$ and $\sigma^{2}_{G\left(\mathbf{x}\right)}$ denote the variance of pixel values in $\mathbf{y}$ and $G(\mathbf{x})$ , while $Cov(\mathbf{y},G(\mathbf{x}))$ is the covariance between the pixel values of the two images. The two constants $c_{1}$ and $c_{2}$ ensure the non-zero property for the denominator and are usually kept to small values. A higher SSIM indicates the two images are more perceptually similar.

Network architecture selection by grid search

The grid search spaces and the final selected networks along with hyper-parameter settings for the generator $G$ and the discriminator $D$ in SuRGe are respectively detailed in Tables V and VI.

TABLE V: Grid search space with the selected

G

architecture and hyperparameter settings of SuRGe.

Block	Parameters	Grid search space	Final network
$C_{0},C_{1}$	Kernel size ( $C_{0},C_{1}$ )	$\{7,9,11\}$	9
	No. of convolutional filters ( $C_{0}$ )	$\{32,64\}$	64
	No. of convolutional filters ( $C_{1}$ )	$\{64,128\}$	128
	Stride ( $C_{0},C_{1}$ )	1	1
	Padding ( $C_{0},C_{1}$ )	$\{1,2,4\}$	4
	Activation ( $C_{0},C_{1}$ )	ReLU, LeakyReLU, PReLU	PReLU
	Normalization ( $C_{0},C_{1}$ )	True, False	False
	Normalization technique ( $C_{0},C_{1}$ )	--	--
	No. of convolution layers ( $C_{0},C_{1}$ )	$\{1,2\}$	1
$R_{0},R_{1}$	Inter-sub-blocks skip connection ( $R_{0},R_{1}$ )	True, False	True
	No. of convolutions in sub-block ( $R_{0},R_{1}$ )	$\{1,2\}$	1
	No. of convolution filters ( $R_{0}$ )	$\{32,64\}$	64
	No. of convolution filters ( $R_{1}$ )	$\{64,128\}$	128
	Kernel size ( $R_{0},R_{1}$ )	$\{3,5\}$	3
	(Stride, Padding) ( $R_{0},R_{1}$ )	(1, 1)	(1, 1)
	Activation presence in sub-blocks ( $R_{0},R_{1}$ )	First, Second, Both	First
	Normalization in sub-blocks ( $R_{0},R_{1}$ )	First, Second, Both	Both
	Activation technique ( $R_{0},R_{1}$ )	ReLU, LeakyReLU, PReLU	PReLU
	Normalization technique ( $R_{0},R_{1}$ )	BatchNorm, PixelNorm	BatchNorm
$I_{0},I_{1}$	Kernel size ( $I_{0},I_{1}$ )	$\{3,5\}$	3
	No. of convolutional filters ( $I_{0}$ )	$\{32,64\}$	64
	No. of convolutional filters ( $I_{1}$ )	$\{64,128\}$	128
	(Stride, Padding) ( $I_{0},I_{1}$ )	(1, 1)	(1, 1)
	Activation ( $I_{0},I_{1}$ )	ReLU, LeakyReLU, PReLU	PReLU
	Normalization ( $I_{0},I_{1}$ )	True, False	True
	Normalization technique ( $I_{0},I_{1}$ )	BatchNorm, PixelNorm	BatchNorm
	No. of convolution layers ( $I_{0},I_{1}$ )	$\{1,2\}$	1
$U_{0},U_{1}$	Initial convolution present	True, False	True
	Kernel size of convolution ( $U_{0},U_{1}$ )	$\{3,5\}$	3
	No. of convolutional filters ( $U_{0}$ )	$\{32,64\}$	64
	No. of convolutional filters ( $U_{1}$ )	$\{64,128\}$	128
	(Stride, Padding) ( $U_{0},U_{1}$ )	(1, 1)	(1, 1)
	Activation ( $U_{0},U_{1}$ )	ReLU, LeakyReLU, PReLU	PReLU
	Normalization ( $U_{0},U_{1}$ )	True, False	False
	Normalization technique ( $U_{0},U_{1}$ )	--	--
	No. of convolution layers ( $U_{0},U_{1}$ )	$\{1,2\}$	1
	Up-scaling technique ( $U_{0},U_{1}$ )	Bi-cubic, PixelShuffle, NN ${}^{*}$	NN
$S$	Structure similar to	$U_{0}\circ C_{1}$	$U_{0}\circ C_{1}$
Optimizer	Learning rate	$\{0.0001,0.00025,0.0005\}$	0.0001
Adam	( $\beta_{1},\beta_{2}$ )	(0.9, 0.99)	(0.9, 0.99)

•

NN ${}^{*}$ : Nearest Neighbour method.

TABLE VI: Grid search space with the selected

D

architecture and hyperparameter settings of SuRGe.

Block	Parameters	Grid search space	Final network
$C$	Kernel size	$\{3,5\}$	3
	No. of convolutional filters	$\{32,64\}$	64
	(Stride, Padding)	(1, 1)	(1, 1)
	Activation	ReLU, LeakyReLU, PReLU	LeakyReLU
	Leakiness of LeakyReLU	$\{0.1,0.2\}$	0.2
	Normalization	True, False	False
	Normalization technique	--	--
	No. of convolution layers	$\{1,2\}$	1
$B$	Inter-sub-blocks skip ( $B$ )	True, False	True
	No. of $B$ blocks	$\{1,2,3,4\}$	4
	No. of conv in sub ( $B_{0}-B_{3}$ )	$\{1,2\}$	1
	No. of conv filters ( $B_{0}-B_{3}$ )	$(64,128,256,512)$	$(64,128,256,512)$
	Kernel size ( $B_{0}-B_{3}$ )	(3, 5, 7, 9)	(3, 5, 7, 9)
	(Stride, Padding) ( $B$ )	(1, 1)	(1, 1)
	Activation in sub ( $B$ )	First, Second, Both	First
	Normalization in sub ( $B$ )	First, Second, Both	Both
	Activation technique ( $B$ )	ReLU, LeakyReLU, PReLU	LeakyReLU
	Leakiness of LeakyReLU ( $B$ )	$\{0.1,0.2\}$	0.2
	Normalization technique ( $B$ )	BatchNorm, PixelNorm	PixelNorm
$H$	Pooling strategy	avgPool, adaptiveAvgPool, maxPool	adaptiveAvgPool
	Output size of adaptiveAvgPool	$\{4,6,8\}$	6
	No. of dense layers	$\{1,2,3\}$	2
	No. of dense nodes	$\{(512,1),(1024,1),(2048,1)\}$	(1024, 1)
	Activation	ReLU, LeakyReLU, PReLU	LeakyReLU
	Leakiness of LeakyReLU	$\{0.1,0.2\}$	0.2
	Normalization	True, False	True
	Normalization technique	--	--
Gradient	Weight $\gamma$	$\{1,10,100\}$	10
Penalty
Optimizer	Learning rate	$\{0.0001,0.00025,0.0005\}$	0.0001
Adam	( $\beta_{1},\beta_{2}$ )	(0, 0.9)	(0, 0.9)

Additional Results

-E Inference time of SuRGe compared to recent GAN and Transformer-based contenders

To confirm if the several improvements made in SuRGe maintain a low inference time we compare the same in seconds against that of three recent GAN and transformer models viz. BSRGAN, SwinFIR, and LTE in the same computing setup. In essence, all algorithms are executed on a computing system with a single AMD Ryzen 9 3900x 12-core processor, one NVIDIA RTX 3090 24GB GPU, and a total of 64GB DDR4 memory. The following Table VII shows that on average on three datasets namely Set5, Set14, and BSD100, the proposed SuRGe provides good quality SR at the lowest inference time.

TABLE VII: Average inference time in seconds for super-resolution of a single image. The best result is boldfaced and the second best is underlined.

Model	Set5	Set14	BSD100
BSRGAN	0.059	0.095	0.077
LTE	0.080	0.132	0.099
SwinFIR	1.032	1.886	1.389
SuRGe (Ours)	0.055	0.089	0.067
Speed-up of SuRGe from the current best model	1.07x	1.07x	1.15x

-F A comparison of performance vs. number of parameters for SuRGe and its notable GAN-based contenders

In the following Table VIII we provide a comparative study of performance (averaged over the four benchmark datasets viz. Set5 [41], Set14 [42], BSD100 [43], and URBAN100 [4], in terms of PSNR and SSIM) vs. the number of parameters (in millions) for GAN-based super-resolution techniques like the proposed SuRGe, PROSR-L [37], SRGAN [8], ESRGAN [26], BeByGAN [29], Rank-SRGAN [51]. We see from Table VIII that even though SuRGe uses about $2\times$ parameters of PROSR-L it improves the PSNR with about 2.55 percentage points (pp) on average. Moreover, SuRGe and SRGAN use an almost similar number of parameters while the proposed outperforms the contender by about 4pp in PSNR and 13pp in SSIM. The rest of the competing methods use about 1.5x-2x parameters of SuRGe while achieving a lower PSNR and SSIM on average. Thus, SuRGe can be considered a comparatively lightweight GAN-based method that demonstrates high performance while maintaining a limit on the number of parameters.

TABLE VIII: Comparison of performance vs. the number of parameters for GAN-based approaches.

GAN-based Technique	No. of Parameters	Average Performance
	(in millions)	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )
PROSR-L [37]	13 ( $\approx 0.5\times$ )	27.78 ( $-4.33$ )	--
SRGAN [8]	25.1 ( $\approx 1\times$ )	26.19 ( $-5.92$ )	0.75 ( $-0.14$ )
ESRGAN [26]	40 ( $\approx 1.5\times$ )	29.15 ( $-2.96$ )	0.81 ( $-0.08$ )
BeByGAN [29]	40 ( $\approx 1.5\times$ )	26.57 ( $-5.54$ )	0.74 ( $-0.15$ )
Rank-SRGAN [51]	53 ( $\approx 2\times$ )	26.07 ( $-6.04$ )	0.65 ( $-0.24$ )
SuRGe (Ours)	25.7	32.11	0.89

•

Green indicates the number of parameters/performance of the contender is better than SuRGe. Red indicates the number of parameters/performance of the contender is worse than SuRGe.
•

The ratio of the number of parameters and the difference in performance is measured considering SuRGe as a reference.

-G Simultaneous demonstration of qualitative and quantitative performances of SuRGe

Additionally, Figure 9 shows a qualitative performance comparison of SuRGe with five notable contenders namely SRGAN [8], BSRGAN [16], Real-ESRGAN [28], LTE [18], and SWIN-IR [17], on five test images viz. Cars (PIRM), Statues (BSD100), Balloons (PIRM), Horses (BSD100), and Lioness (OST300). For a simultaneous quantitative comparison we present the SSIM and PSNR values for the six algorithms on the five samples in Table IX. Figure 9 demonstrates how SuRGe maintains a higher degree of details in the SR output while the same is further quantitatively attested by the improved PSNR and SSIM in Table IX. Moreover, for the ease of visualization, a patch-based qualitative comparison is also presented in Figures 10 and 11 that enables focused observation on a particular region of the SR output.

TABLE IX: Additional quantitative performance comparison of SuRGe with the five contenders and five test samples used in Figure 9. The best is boldfaced while the second-best in underlined.

Method	Metric	Cars	Statues	Balloons	Horses	Lioness
SRGAN	PSNR	29.52	28.83	29.77	29.00	30.54
SRGAN	SSIM	0.57	0.46	0.65	0.59	0.71
BSRGAN	PSNR	30.29	29.29	30.38	31.05	30.67
BSRGAN	SSIM	0.53	0.49	0.61	0.63	0.70
Real-ESRGAN	PSNR	30.38	29.10	30.32	30.92	31.08
Real-ESRGAN	SSIM	0.58	0.48	0.65	0.62	0.70
LTE	PSNR	31.56	29.61	30.51	31.83	32.07
LTE	SSIM	0.71	0.58	0.79	0.70	0.79
SWIN-IR	PSNR	31.69	29.65	31.66	31.93	32.13
SWIN-IR	SSIM	0.72	0.59	0.81	0.72	0.79
SuRGe (ours)	PSNR	34.21	32.79	32.31	33.13	34.74
SuRGe (ours)	SSIM	0.82	0.79	0.86	0.86	0.91

Fortifying Fully Convolutional Generative Adversarial Networks for Image Super-Resolution Using Divergence Measures

Abstract

Index Terms:

I Introduction

II Related Works

III Proposed Method

III-A The Architecture of G𝐺Gitalic_G

III-B The Architecture of D𝐷Ditalic_D

III-C Loss functions of SuRGe

III-C1 Loss for generator G𝐺Gitalic_G:

III-C2 Loss for discriminator D𝐷Ditalic_D:

III-D Putting it all together

IV Experiments

IV-A Experimental Protocol

IV-B Ablation study

IV-C Quantitative performance of SuRGe

IV-D Qualitative comparison

V Conclusion and Future Works

References

Detailed architecture of the SuRGe network

-A Architecture of Generator

-B Architecture of Discriminator

Algorithm of SuRGe

Description of datasets

Metrics

-C Peak Signal to Noise Ratio (PSNR)

-D Structural Similarity Index (SSIM)

Network architecture selection by grid search

Additional Results

-E Inference time of SuRGe compared to recent GAN and Transformer-based contenders

-F A comparison of performance vs. number of parameters for SuRGe and its notable GAN-based contenders

-G Simultaneous demonstration of qualitative and quantitative performances of SuRGe

III-A The Architecture of $G$

III-B The Architecture of $D$

III-C1 Loss for generator $G$ :

III-C2 Loss for discriminator $D$ :