Towards arbitrary-scale histopathology image super-resolution: An efficient dual-branch framework via implicit self-texture enhancement

Minghong Duan Linhao Qu Zhiwei Yang Manning Wang Chenxi Zhang [email protected] Zhijian Song [email protected] Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China Academy for Engineering and Technology, Fudan University, Shanghai 200433, China
Abstract

High-quality whole-slide scanning is expensive, complex, and time-consuming, thus limiting the acquisition and utilization of high-resolution histopathology images in daily clinical work. Deep learning-based single-image super-resolution (SISR) techniques are an effective way to solve this problem. However, the existing SISR models applied in pathological images can only work in fixed integer scaling factors, decreasing their applicability. Though methods based on implicit neural representation (INR) have shown promising results in arbitrary-scale super-resolution (SR) of natural images, applying them directly to pathological images is inadequate because they have unique fine-grained image textures different from natural images. Thus, we propose an Implicit Self-Texture Enhancement-based dual-branch framework (ISTE) for arbitrary-scale SR of pathological images to address this challenge. ISTE contains a feature aggregation branch and a texture learning branch. We employ the feature aggregation branch to enhance the learning of the features’ relevance in the local region while utilizing the texture learning branch to enhance the learning of high-frequency texture details. Then, we design a two-stage texture enhancement strategy to fuse the features from the two branches to obtain the SR images, where the first stage is feature-based texture enhancement, and the second stage is spatial-domain-based texture enhancement. Experiments on publicly available datasets, including TMA, HistoSR, and TCGA lung cancer datasets, demonstrate that ISTE outperforms existing fixed-scale and arbitrary-scale SR algorithms at multiple scaling factors and helps to improve downstream task performance. To the best of our knowledge, this is the first work to achieve arbitrary-scale SR in pathological images.

keywords:
super-resolution, histopathology images, implicit neural representation
\useunder

\ul

1 Introduction

High-resolution (HR) pathology whole slide images (WSIs) contain rich cellular morphology and pathological patterns, and they are the gold standard for clinical diagnosis and the basis for automated pathology image analysis tasks, including segmentation, detection, and classification [1, 2, 3, 4]. However, the acquisition and utilization of digital WSIs remain limited in the daily clinical workflow [5, 4]. On the one hand, HR digital WSIs are typically obtained through sophisticated and costly whole-slide scanning equipment, which is often difficult to access in remote and underserved regions. On the other hand, acquiring HR digital WSIs involves using dedicated micro-cameras within the whole slide scanner to capture image fragments from different local regions of the specimen, which are then stitched together to form a complete image depicting the entire specimen. Such a digital process is highly time-consuming [5, 4]. Furthermore, HR digital WSIs are very large, often reaching gigapixels, which places additional demands on clinical funding support, professional training, ample data storage, and efficient data management [6, 2]. Therefore, if it is possible to scan low-resolution (LR) pathological images with cheaper devices while designing algorithms that can produce WSIs maintaining similar image quality, the digitization process could be accelerated, and the clinical application of automated techniques to analyze pathological images could be promoted [5, 4, 7].

Super-resolution (SR) algorithms based on deep learning can accurately map a single LR image to an HR image [8, 9, 10]. Recently, deep learning-based methods have been widely applied in pathological image SR. Most approaches construct a large dataset of LR-HR image pairs to train neural networks in an end-to-end manner. The trained neural networks can generate HR pathological images with input LR pathological images. For example, Mukherjee et al. [10] utilized a convolutional neural network with an upsampling layer to generate HR images. Chen et al. [11] proposed a spatial wavelet dual-stream network to perform the SR image generation. As shown in Fig.1(a), although these methods demonstrate commendable performance, they can only be trained and tested at a fixed integer scale, and the network needs to be retrained at a specific scale if other scaling factors are needed. However, in clinical pathological diagnosis, doctors usually need to continuously zoom in and out of sections at different scaling factors, so the applicability of these models is greatly limited. Unfortunately, to our knowledge, there are currently no models that can achieve arbitrary-scale SR for pathological images.

Recently, inspired by implicit neural representation (INR) [12, 13, 14], some studies have pioneered arbitrary-scale SR for natural images. For example, LIIF [15] represents 2D images as latent code through an encoder and maps the input coordinates and corresponding latent variables to RGB values through the decoding function based on a multilayer perceptron (MLP), enabling image SR at arbitrary scales. As shown in Fig.1(b), although these methods can be directly applied to pathological images, they do not consider the texture characteristics of pathological images and can only achieve sub-optimal performance. As shown in Fig.1(d), pathological images contain a large amount of fine-grained cell morphology and repetition, unlike natural images. Better reconstructing the special image texture at arbitrary scales is essential in pathological image SR.

Motivated by the observation above, we propose an efficient dual-branch framework based on implicit self-texture enhancement (ISTE) for arbitrary-scale SR of pathological images to better deal with its special texture. Fig.1(c) briefly illustrates the overall framework of ISTE. Specifically, ISTE contains a feature aggregation branch and a texture learning branch. In the feature aggregation branch, we propose the local feature interactor (LFI) module to enhance the interaction of features in the local region; in the texture learning branch, we propose the texture learner (TL) to enhance the learning of high-frequency texture information. After that, we design a two-stage texture enhancement strategy for these two branches, where the first stage is feature-based texture enhancement, and the second stage is spatial domain-based texture enhancement. As shown in Fig.2, considering that pathological images contain many similar cell morphologies and periodic texture patterns, we assume that these similar regions can assist each other in reconstruction in the feature space, so we design the self-texture fusion (STF) module to accomplish feature-based texture enhancement. The main idea is to retrieve the texture information from the texture learning branch and transfer it to the feature aggregation branch for information fusion and enhancement. For spatial domain texture enhancement, we decode the features of the two branches into RGB values in the spatial domain using the local pixel decoder (LPD) and the local texture decoder (LTD), respectively, and perform information fusion in the spatial domain. These two decoders are based on implicit neural networks [15], thus enabling image SR at arbitrary scales. Extensive experiments on three public datasets have shown that ISTE performs better than existing fixed-scale and arbitrary-scale algorithms at multiple scales and helps to improve downstream task performance. To the best of our knowledge, this is the first work to achieve arbitrary-scale SR in pathological images. Overall, the contributions of this paper are as follows:

  • 1.

    We propose an efficient dual-branch framework based on implicit self-texture enhancement (ISTE) for arbitrary-scale SR of pathological images. ISTE recovers the texture details of the image through feature-based texture enhancement and spatial domain-based texture enhancement. To the best of our knowledge, it represents the first attempt to achieve arbitrary-scale SR in pathological images;

  • 2.

    The proposed ISTE achieves state-of-the-art performance at various scaling factors on three public datasets, and we demonstrate the effectiveness of the proposed texture enhancement strategy through a series of ablation experiments;

  • 3.

    The pathological images reconstructed by ISTE are shown to be usable in two downstream WSI analysis tasks, gland segmentation, and malignancy classification, and the performance of these two tasks can be improved by utilizing the reconstructed HR images;

Refer to caption
Figure 1: Motivation of our ISTE. (a) All existing studies for pathological images can only achieve fixed integer-scale SR and need to retrain the model to achieve different scaling factors; (b) Existing natural image SR algorithms based on implicit neural networks (exemplified by LIIF [14]) perform SR directly in the spatial domain, and all lack attention and enhancement of image texture information; (c) ISTE is an efficient dual-branch framework based on implicit self-texture enhancement for arbitrary-scale pathological image SR. ISTE further enhances its performance through feature-based and spatial-domain-based texture enhancement; (d) We use the canny operator to extract texture from natural and pathological images, respectively. It can be seen that, in contrast to natural images, pathological images contain a large amount of fine-grained cell morphology and arrangement information and tend to have richer texture information.

2 Related Works

2.1 Deep learning-based super-resolution methods for natural images

Single-image super-resolution (SISR) refers to recovering an HR image from an LR image or an LR image sequence, which is a classical low-level computer vision task with a wide range of applications. Deep neural networks can achieve accurate mapping from LR images to HR images due to their powerful fitting ability. Thus, they have become the mainstream approach in current SR studies. Numerous methods based on convolutional neural networks (CNNs) have been proposed for natural image SR, including SRCNN [16], EDSR [9], RDN [17], and RCAN [18]. To further improve the performance of SR, some methods utilized residual modules [19, 20], densely connected modules [21, 22], and other blocks [23, 24] for the design of the CNNs. Subsequently, a series of attention-based SR methods have emerged, such as channel attention [25, 18], self-attention (IPT [26], SwinIR [27], HAT [28]), and non-local attention [29, 30]. However, these methods can only be trained and tested at a fixed integer scale, and the networks need to be retrained for new scaling factors.

In recent years, implicit neural representation (INR) has been proposed as a continuous data representation for various tasks in computer vision. INR uses a neural network (usually a coordinate-based MLP) to establish a mapping between coordinates and their signal values, which allows continuous and efficient modeling of 2D image signals. For example, Chen et al. [15] first used INR in the SR algorithm and proposed the local implicit image function (LIIF) for arbitrary-scale SR. Lee et al. [31] proposed the local texture estimator (LTE), which transforms coordinates into Fourier domain information to enhance the representation of the local implicit function. Although these methods can be directly applied to pathological images for continuous magnification, they fail to recover the special textures in pathological images effectively.

2.2 Deep learning-based super-resolution methods for pathological images

In recent years, deep learning-based SR algorithms have been widely used in pathological images to improve imaging resolution [32, 10, 33, 11, 34, 8, 35, 36]. Upadhyay et al. [32] developed a generative adversarial network that considered pathological image SR and surgical smoke removal tasks at the same time. Mukherjee et al. [10] implemented SR image generation using a CNN and up-sampling layer and augmented the outputs using the K-nearest neighbor algorithm. Chen et al. [11] accomplished the SR task through a spatial wavelet dual-stream network incorporating a refine context fusion module. Li et al. [8] utilized a generative adversarial network based on a multi-scale CNN for SR image generation and introduced a curriculum learning training strategy. Wu et al. [35] added a branch for magnification classification to the SR network and improved SR performance through multi-task learning. These studies demonstrate the promise of using SR to improve pathological image resolution in low-resource settings. However, they still have some limitations. For instance, they restrict training and testing to specific scaling factors, and the resultant SR outputs still exhibit scope for refinement. We attribute this primarily to a lack of adequate consideration for the unique textural characteristics of pathological images. In this paper, we introduce ISTE as a solution to overcome these challenges, aiming to achieve arbitrary-scale SR of pathological images with high quality.

Refer to caption
Figure 2: Workflow of our ISTE. The LR image XLRsubscript𝑋𝐿𝑅X_{LR}italic_X start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT is input into the encoder to get the pre-extracted feature map FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT first. In the feature aggregation branch, we input the feature FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT into the local feature interactor and a convolutional layer to obtain FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT. In the texture learning branch, we input the feature FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT into the texture learner to obtain the texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT. Then the feature maps from the two branches are input to the self-texture fusion module to accomplish feature-based enhancement. Finally, the enhanced feature FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT output from the STF module and the texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT output from the texture learner are decoded into RGB values respectively, and added up to accomplish spatial domain-based texture enhancement.

3 Method

3.1 Problem formulation and framework overview

Given a set of N𝑁Nitalic_N pairs of corresponding LR images and HR images {XLRi,YHRi}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑋𝐿𝑅𝑖superscriptsubscript𝑌𝐻𝑅𝑖𝑖1𝑁\left\{X_{LR}^{i},Y_{HR}^{i}\right\}_{i=1}^{N}{ italic_X start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the objective is to find the optimal parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG of the SR model Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:

θ^=argθmin1Ni=1NL(Fθ(XLRi),YHRi)^𝜃subscript𝜃1𝑁superscriptsubscript𝑖1𝑁𝐿subscript𝐹𝜃superscriptsubscript𝑋𝐿𝑅𝑖superscriptsubscript𝑌𝐻𝑅𝑖\hat{\theta}=\arg_{\theta}\min\frac{1}{N}\sum_{i=1}^{N}L\left(F_{\theta}\left(% X_{LR}^{i}\right),Y_{HR}^{i}\right)over^ start_ARG italic_θ end_ARG = roman_arg start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (1)

where XLRisuperscriptsubscript𝑋𝐿𝑅𝑖X_{LR}^{i}italic_X start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a LR image and YHRisuperscriptsubscript𝑌𝐻𝑅𝑖Y_{HR}^{i}italic_Y start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is its corresponding ground truth (GT), and L is the L1 loss function to measure the difference between the ground-truth and the generated HR images. Fig.2 shows the overall framework of our proposed ISTE. We first utilize SwinIR [27] to perform feature pre-extraction on the input LR image XLRsubscript𝑋𝐿𝑅X_{LR}italic_X start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT and then input the pre-extracted feature FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT into the upper feature aggregation branch and lower texture learning branch of ISTE, respectively. In the feature aggregation branch, we input the feature FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT into the local feature interactor (LFI) to enhance the interaction of features in the local region and obtain feature FLFIsubscript𝐹𝐿𝐹𝐼F_{LFI}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I end_POSTSUBSCRIPT. In the texture learning branch, we input the image feature FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT into the texture learner (TL) to enhance the learning of high-frequency information and extract the feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT. Then we design a two-stage texture enhancement strategy for these two branches, where the first stage is feature-based texture enhancement, and the second stage is spatial domain-based texture enhancement. In the first stage, we designed the self-texture fusion (STF) module to leverage the interaction of similar regions of the pathological images in the feature space, thereby accomplishing feature-based texture enhancement to assist in reconstruction. In the second stage, we decode the FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT from the STF module to obtain the image ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT through the local pixel decoder (LPD). Simultaneously, we decode the FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT from the TL module to obtain the image ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT through the local texture decoder (LTD). Subsequently, we perform spatial summation of ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT and ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT, obtaining the final reconstructed HR image IPredsubscript𝐼𝑃𝑟𝑒𝑑I_{Pred}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT. The primary purpose of the second stage is to fully utilize the features FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT learned by the texture learner and decode them into the spatial domain for texture enhancement.

3.2 Local feature interactor

We propose the LFI to enhance the interaction of features within local regions, thereby capturing the correlation of features within local regions. As shown in Fig.3, the size of the feature map FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT is h×w×64𝑤64h\times w\times 64italic_h × italic_w × 64, and we denote each vector of FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT as FLRj(j=1,2,,h×w)superscriptsubscript𝐹𝐿𝑅𝑗𝑗12𝑤F_{LR}^{j}(j=1,2,\ldots,h\times w)italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_j = 1 , 2 , … , italic_h × italic_w ). The LFI first assigns a window of size 3×3333\times 33 × 3 to each vector of FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT, and the eight neighboring vectors in the window around FLRjsuperscriptsubscript𝐹𝐿𝑅𝑗F_{LR}^{j}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT form a set FNj={FNiji=3,4,,10}superscriptsubscript𝐹𝑁𝑗conditional-setsuperscriptsubscript𝐹subscript𝑁𝑖𝑗𝑖3410F_{N}^{j}=\left\{F_{N_{i}}^{j}\mid i=3,4,\ldots,10\right\}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_F start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∣ italic_i = 3 , 4 , … , 10 }. The average pooling result of the vectors within a window is denoted as FPjsuperscriptsubscript𝐹𝑃𝑗F_{P}^{j}italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. The feature map FLFIsubscript𝐹𝐿𝐹𝐼F_{LFI}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I end_POSTSUBSCRIPT output by the LFI is calculated through self-attention so that each point on the feature map incorporates local features while paying more attention to itself. We denote each vector of FLFIsubscript𝐹𝐿𝐹𝐼F_{LFI}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I end_POSTSUBSCRIPT as FLFIj(j=1,2,,h×w)superscriptsubscript𝐹𝐿𝐹𝐼𝑗𝑗12𝑤F_{LFI}^{j}(j=1,2,\ldots,h\times w)italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_j = 1 , 2 , … , italic_h × italic_w ), and it is calculated through Eq.(2).

FLFIj=i=110exp((QLRj)TKij)dΣi=110exp((QLRj)TKij)Vijsuperscriptsubscript𝐹𝐿𝐹𝐼𝑗superscriptsubscript𝑖110superscriptsuperscriptsubscript𝑄𝐿𝑅𝑗𝑇superscriptsubscript𝐾𝑖𝑗𝑑superscriptsubscriptΣ𝑖110superscriptsuperscriptsubscript𝑄𝐿𝑅𝑗𝑇superscriptsubscript𝐾𝑖𝑗superscriptsubscript𝑉𝑖𝑗F_{LFI}^{j}=\sum_{i=1}^{10}\frac{\exp\left(\left(Q_{LR}^{j}\right)^{T}K_{i}^{j% }\right)}{\sqrt{d}\Sigma_{i=1}^{10}\exp\left(\left(Q_{LR}^{j}\right)^{T}K_{i}^% {j}\right)}V_{i}^{j}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT divide start_ARG roman_exp ( ( italic_Q start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d end_ARG roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT roman_exp ( ( italic_Q start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (2)

where QLRjsuperscriptsubscript𝑄𝐿𝑅𝑗Q_{LR}^{j}italic_Q start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the query mapped linearly from FLRjsuperscriptsubscript𝐹𝐿𝑅𝑗F_{LR}^{j}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, K1jsuperscriptsubscript𝐾1𝑗K_{1}^{j}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the key mapped linearly from FLRjsuperscriptsubscript𝐹𝐿𝑅𝑗F_{LR}^{j}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, V1jsuperscriptsubscript𝑉1𝑗V_{1}^{j}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the value mapped linearly from FLRjsuperscriptsubscript𝐹𝐿𝑅𝑗F_{LR}^{j}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, K2jsuperscriptsubscript𝐾2𝑗K_{2}^{j}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the key mapped linearly from FPjsuperscriptsubscript𝐹𝑃𝑗F_{P}^{j}italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, V2jsuperscriptsubscript𝑉2𝑗V_{2}^{j}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the value mapped linearly from FPjsuperscriptsubscript𝐹𝑃𝑗F_{P}^{j}italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, {Kiji=3,4,,10}conditional-setsuperscriptsubscript𝐾𝑖𝑗𝑖3410\left\{K_{i}^{j}\mid i=3,4,\ldots,10\right\}{ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∣ italic_i = 3 , 4 , … , 10 } is the key mapped linearly from FNjsuperscriptsubscript𝐹𝑁𝑗F_{N}^{j}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, {Viji=3,4,,10}conditional-setsuperscriptsubscript𝑉𝑖𝑗𝑖3410\left\{V_{i}^{j}\mid i=3,4,\ldots,10\right\}{ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∣ italic_i = 3 , 4 , … , 10 } is the value mapped linearly from FNjsuperscriptsubscript𝐹𝑁𝑗F_{N}^{j}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, and d𝑑ditalic_d is the dimension of these vectors. The parameters used by each window are shared in the self-attention calculation.

Refer to caption
Figure 3: Local feature interactor.

3.3 Texture learner

Inspired by LTE [31], we propose the TL for learning high-frequency texture information in pathological images. We employ sine activation to effectively enhance implicit neural representations for learning high-frequency details in images, thereby mitigating spectral bias issues stemming from the ReLU activation functions[12]. Specifically, we normalize the value of 2D pixel coordinate (X,Y)={(xi,yj)i=1,2,,mw,j=1,2,,mh}superscript𝑋superscript𝑌conditional-setsuperscriptsubscriptx𝑖superscriptsubscripty𝑗formulae-sequencei12𝑚𝑤j12𝑚\left(X^{\prime},Y^{\prime}\right)=\left\{\left(\mathrm{x}_{i}^{\prime},% \mathrm{y}_{j}^{\prime}\right)\mid\mathrm{i}=1,2,\ldots,mw,\mathrm{j}=1,2,% \ldots,mh\right\}( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { ( roman_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ roman_i = 1 , 2 , … , italic_m italic_w , roman_j = 1 , 2 , … , italic_m italic_h } in the continuous HR image domain and the value of 2D pixel coordinate (X,Y)={(xi,yj)i=1,2,,mw,j=1,2,,mh}𝑋𝑌conditional-setsubscriptx𝑖subscripty𝑗formulae-sequencei12𝑚𝑤j12𝑚(X,Y)=\left\{\left(\mathrm{x}_{i},\mathrm{y}_{j}\right)\mid\mathrm{i}=1,2,% \ldots,mw,\mathrm{j}=1,2,\ldots,mh\right\}( italic_X , italic_Y ) = { ( roman_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ roman_i = 1 , 2 , … , italic_m italic_w , roman_j = 1 , 2 , … , italic_m italic_h } nearest to (X,Y)superscript𝑋superscript𝑌\left(X^{\prime},Y^{\prime}\right)( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the continuous LR image domain between -1 and 1, and the Local Grid is defined as (XX,YY)superscript𝑋𝑋superscript𝑌𝑌\left(X^{\prime}-X,Y^{\prime}-Y\right)( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_X , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_Y ). Since each pixel coordinate of the HR image has a corresponding coordinate in the LR image grid that is closest to it, the number of both the HR and LR image coordinates is equal to mh×mw𝑚𝑚𝑤mh\times mwitalic_m italic_h × italic_m italic_w, where m𝑚mitalic_m represents the scale factor. As shown in Fig.4(a), the TL module firstly outputs three feature maps FAmph×w×256subscript𝐹𝐴𝑚𝑝𝑤256F_{Amp}\in h\times w\times 256italic_F start_POSTSUBSCRIPT italic_A italic_m italic_p end_POSTSUBSCRIPT ∈ italic_h × italic_w × 256, FFreqXh×w×256subscript𝐹𝐹𝑟𝑒𝑞𝑋𝑤256F_{FreqX}\in h\times w\times 256italic_F start_POSTSUBSCRIPT italic_F italic_r italic_e italic_q italic_X end_POSTSUBSCRIPT ∈ italic_h × italic_w × 256 and FFreqYh×w×256subscript𝐹𝐹𝑟𝑒𝑞𝑌𝑤256F_{FreqY}\in h\times w\times 256italic_F start_POSTSUBSCRIPT italic_F italic_r italic_e italic_q italic_Y end_POSTSUBSCRIPT ∈ italic_h × italic_w × 256 through three 3×3333\times 33 × 3 convolutional kernels respectively, and predicts the feature maps Ampmh×mw×256𝐴𝑚𝑝𝑚𝑚𝑤256Amp\in mh\times mw\times 256italic_A italic_m italic_p ∈ italic_m italic_h × italic_m italic_w × 256, FreqXmh×mw×256𝐹𝑟𝑒𝑞𝑋𝑚𝑚𝑤256FreqX\in mh\times mw\times 256italic_F italic_r italic_e italic_q italic_X ∈ italic_m italic_h × italic_m italic_w × 256 and FreqYmh×mw×256𝐹𝑟𝑒𝑞𝑌𝑚𝑚𝑤256FreqY\in mh\times mw\times 256italic_F italic_r italic_e italic_q italic_Y ∈ italic_m italic_h × italic_m italic_w × 256 corresponding to each pixel coordinate of the HR image through nearest-neighbor interpolation. Then we use linear projection based on an MLP and Sigmoid activation function to map (2/mw,2/mh)2mw2mh(2/\mathrm{mw},2/\mathrm{mh})( 2 / roman_mw , 2 / roman_mh ) to a 256-dimensional feature vector Phase𝑃𝑎𝑠𝑒Phaseitalic_P italic_h italic_a italic_s italic_e to simulate the effect of texture fragment offset when the image scaling factor changes. The output of the TL module is calculated by Eq.(3):

FTL=Amp\displaystyle F_{TL}={Amp}\otimesitalic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT = italic_A italic_m italic_p ⊗ Sin(FreqX(XX)+FreqY(YY)+Phase)Sindirect-product𝐹𝑟𝑒𝑞𝑋superscript𝑋𝑋direct-product𝐹𝑟𝑒𝑞𝑌superscript𝑌𝑌𝑃𝑎𝑠𝑒\displaystyle\operatorname{Sin}({FreqX}\odot\left(X^{\prime}-X\right)+{FreqY}% \odot\left(Y^{\prime}-Y\right)+{Phase})roman_Sin ( italic_F italic_r italic_e italic_q italic_X ⊙ ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_X ) + italic_F italic_r italic_e italic_q italic_Y ⊙ ( italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_Y ) + italic_P italic_h italic_a italic_s italic_e ) (3)

where tensor-product\otimes represents element-wise multiplication and direct-product\odot represents inner product operation.

Refer to caption
Figure 4: (a) Texture learner; (b) Self-texture fusion module; (c) Coordinate diagram of FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT and FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT for the local pixel decoder and local texture decoder.

3.4 Self-texture fusion module for feature-based enhancement

Inspired by SRNTT [37] and T2Net [38], we propose the cross-attention-based STF module, whose main idea is to globally retrieve texture features most similar to FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT in FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT and fuse the retrieved features to FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT, thus completing the feature-based texture enhancement. As shown in Fig.4(b), we use the features sampled from FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT by nearest-neighborhood interpolation as the query (Q) and use FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT as the key (K𝐾Kitalic_K) and value (V𝑉Vitalic_V) of the cross-attention module. To retrieve the texture features that are most relevant to the pixel feature FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT, we first compute the similarity matrix R𝑅Ritalic_R of Q𝑄Qitalic_Q and K𝐾Kitalic_K, where each element ri,jsubscript𝑟𝑖𝑗r_{i,j}italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of R𝑅Ritalic_R is computed according to Eq.(4), where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents an element of Q𝑄Qitalic_Q, and kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents an element of K𝐾Kitalic_K. Then we obtain the coordinate index matrix T𝑇Titalic_T with the highest similarity to qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in K𝐾Kitalic_K. An element in T𝑇Titalic_T is ti=argmaxj(ri,j)subscript𝑡𝑖subscript𝑗subscript𝑟𝑖𝑗t_{i}=\arg\max_{j}\left(r_{i,j}\right)italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ), and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the position coordinates of the texture feature kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the highest similarity to qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT. We pick the feature vector aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the highest similarity to each element in Q𝑄Qitalic_Q from V𝑉Vitalic_V according to the coordinate index matrix T𝑇Titalic_T to obtain the retrieved texture feature A𝐴Aitalic_A, which can be represented by ai=vtisubscript𝑎𝑖subscript𝑣subscript𝑡𝑖a_{i}=v_{t_{i}}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an element in A𝐴Aitalic_A and vtisubscript𝑣subscript𝑡𝑖v_{t_{i}}italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the element at the tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-th position in V𝑉Vitalic_V. To fuse the retrieved texture feature A𝐴Aitalic_A with the feature FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT, we first concatenate FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT with A𝐴Aitalic_A and obtain the aggregated feature Z𝑍Zitalic_Z through the output of an MLP, that is Z=MLP(Concat(FLFIC,A))𝑍𝑀𝐿𝑃𝐶𝑜𝑛𝑐𝑎𝑡subscript𝐹𝐿𝐹𝐼𝐶𝐴Z=MLP(Concat(F_{LFIC},A))italic_Z = italic_M italic_L italic_P ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT , italic_A ) ). Finally, we calculate the soft attention map S𝑆Sitalic_S, where an element sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in S𝑆Sitalic_S represents the confidence of each element aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the retrieved texture feature A𝐴Aitalic_A, and si=maxj(ri,j)subscript𝑠𝑖subscript𝑗subscript𝑟𝑖𝑗s_{i}=\max_{j}\left(r_{i,j}\right)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ). FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT is calculated as below:

ri,j=qiqi,kjkjsubscript𝑟𝑖𝑗subscript𝑞𝑖normsubscript𝑞𝑖subscript𝑘𝑗normsubscript𝑘𝑗r_{i,j}=\left\langle\frac{q_{i}}{\left\|q_{i}\right\|},\frac{k_{j}}{\left\|k_{% j}\right\|}\right\rangleitalic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ⟨ divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG , divide start_ARG italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG ⟩ (4)
FSTF=FLFICZSsubscript𝐹𝑆𝑇𝐹direct-sumsubscript𝐹𝐿𝐹𝐼𝐶tensor-product𝑍𝑆F_{STF}=F_{LFIC}\oplus Z\otimes Sitalic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT ⊕ italic_Z ⊗ italic_S (5)

where delimited-⟨⟩\langle\cdot\rangle⟨ ⋅ ⟩ represents inner product operation, \|\cdot\|∥ ⋅ ∥ represents the square root operation, and direct-sum\oplus represents element-wise summation.

3.5 Spatial domain-based enhancement

In the spatial domain-based texture enhancement, we decode the texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT directly into the spatial domain ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT and add it to ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT, which is reconstructed from FFLICsubscript𝐹𝐹𝐿𝐼𝐶F_{FLIC}italic_F start_POSTSUBSCRIPT italic_F italic_L italic_I italic_C end_POSTSUBSCRIPT by the LPD, to obtain the final output IPredsubscript𝐼𝑃𝑟𝑒𝑑I_{Pred}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT. Firstly, we utilize the LPD to decode the feature FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT into the RGB value ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT. We parameterize the LPD as an MLP. As shown in Fig.4(c), utsubscript𝑢𝑡{u_{t}}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the coordinates of the FLRsubscript𝐹𝐿𝑅{F_{LR}}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT and xqsubscript𝑥𝑞{x_{q}}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denotes the coordinates of the FSTFsubscript𝐹𝑆𝑇𝐹{F_{STF}}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT and FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT. We use ut(t00,01,10,11)subscript𝑢𝑡𝑡00011011u_{t}(t\in 00,01,10,11)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ∈ 00 , 01 , 10 , 11 ) to denote the upper-left, upper-right, lower-left, and lower-right coordinates of an arbitrary point xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, respectively. The RGB value at the coordinate xqsubscript𝑥𝑞{x_{q}}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT in the HR image decoded by the LPD can be represented by Eq.(6), where c𝑐citalic_c contains two elements, 2/mh2𝑚2/mh2 / italic_m italic_h and 2/mw2𝑚𝑤2/mw2 / italic_m italic_w, representing the sizes of each pixel in the ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT. Similarly, we calculate the RGB values of the texture information ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT at coordinate xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT via Eq.(7), where the LTD is parameterized as an MLP gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT. We use the LTD to decode the texture features into the spatial domain texture information ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT and add it to the ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT via Eq.(8) for spatial domain texture enhancement to obtain the prediction result IPredsubscript𝐼𝑃𝑟𝑒𝑑I_{Pred}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT, where φ𝜑{\varphi}italic_φ is the network parameter of the MLP gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT. St(t00,01,10,11)subscript𝑆𝑡𝑡00011011S_{t}(t\in 00,01,10,11)italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ∈ 00 , 01 , 10 , 11 ) is the area of the rectangular region between xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the weights are normalized by S=t{00,01,10,11}St𝑆subscript𝑡00011011subscript𝑆𝑡S=\sum_{t\in\{00,01,10,11\}}S_{t}italic_S = ∑ start_POSTSUBSCRIPT italic_t ∈ { 00 , 01 , 10 , 11 } end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

ILPD=t{00,01,10,11}StSfθ(FSTF,xqut,c)subscript𝐼𝐿𝑃𝐷subscript𝑡00011011subscript𝑆𝑡𝑆subscript𝑓𝜃subscript𝐹𝑆𝑇𝐹subscript𝑥𝑞subscript𝑢𝑡𝑐I_{LPD}=\sum_{t\in\{00,01,10,11\}}\frac{S_{t}}{S}\cdot f_{\theta}\left(F_{STF}% ,x_{q}-u_{t},{c}\right)italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ∈ { 00 , 01 , 10 , 11 } end_POSTSUBSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_S end_ARG ⋅ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) (6)
ILTD=t{00,01,10,11}StSgφ(FTL)subscript𝐼𝐿𝑇𝐷subscript𝑡00011011subscript𝑆𝑡𝑆subscript𝑔𝜑subscript𝐹𝑇𝐿I_{LTD}=\sum_{t\in\{00,01,10,11\}}\frac{S_{t}}{S}\cdot g_{\varphi}\left(F_{TL}\right)italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ∈ { 00 , 01 , 10 , 11 } end_POSTSUBSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_S end_ARG ⋅ italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT ) (7)
IPred=ILPD+ILTDsubscript𝐼𝑃𝑟𝑒𝑑subscript𝐼𝐿𝑃𝐷subscript𝐼𝐿𝑇𝐷I_{Pred}=I_{LPD}+I_{LTD}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT (8)

4 Experiments

We introduce the datasets, the implementation details, and the comparison to state-of-the-art SR methods in sections 4.1, 4.2, and 4.3, respectively. Then, we conduct a series of ablation studies in section 4.4. Finally, we perform two downstream task experiments, gland segmentation, and malignancy classification, to show that the HR images reconstructed by the proposed ISTE can help improve performance on downstream tasks in section 4.5.

4.1 Datasets

4.1.1 Tissue Microarray (TMA) dataset

Following Li et al. [8], we experimented on the TMA dataset to validate our method. The TMA dataset, a widely used public dataset in pancreatic cancer research [39, 40], was scanned by an Aperio AT digital pathology scanner (Leica Biosystems, Wetzlar, Germany) at magnification of 0.504 μμ\upmuroman_μm/pixel and contains 573 WSIs (average 3850×3850 pixels each). We randomly selected 460 WSIs as the training set, 57 WSIs as the validation set, and 56 WSIs as the test set.

4.1.2 Histopathology Super-Resolution (HistoSR) dataset

Following Chen et al. [11], we conducted experiments on the Histopathology Super-Resolution (HistoSR) dataset, which is built on the high-quality H&E stained WSIs of the Camelyon16 dataset. The HistoSR dataset contains HR images with a patch size of 192×192 through random cropping. The training set comprises 30000 HR patches, while the test set consists of 5000 HR patches.

4.1.3 TCGA Lung Cancer dataset

The TCGA lung cancer dataset comprises 1054 WSIs (average 100000×100000 pixels each) [41] from The Cancer Genome Atlas (TCGA) data center. We selected five slides from this dataset and cut them into 400 sub-images with a size of 3072×3072. We randomly selected 320 sub-images as the training set, 40 as the validation set, and 40 as the test set.

4.2 Implementation details and evaluation metrics

Following previous SR methods based on implicit neural representation [15, 31], we used the patches with the size of 48×484848{48\times 48}48 × 48 as the input for training. We first randomly sampled the scaling factor m𝑚{m}italic_m in a uniform distribution U(1, 4) and cropped patches with the size of 48m×48m48𝑚48𝑚48m\times 48m48 italic_m × 48 italic_m from the raw HR images in a batch, where m𝑚{m}italic_m represents the scaling factor. Following [8, 36], we resized the patches to 48×484848{48\times 48}48 × 48 via bicubic downsampling and did a Gaussian blur to simulate degradation since it is difficult to acquire authentically downsampled images at arbitrary scales through scanners. The size of the Gaussian kernel was set to 1/2 of the scaling factor m𝑚{m}italic_m. We sampled 482superscript48248^{2}48 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels from the corresponding cropped patches to form RGB-Coordinate pairs. We utilized the deep learning toolbox Pytorch to implement ISTE and Adam as the optimizer, setting the initial learning rate to 0.0001 and epochs to 1000. We employed structure similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) to evaluate the quality of reconstructed HR images.

Refer to caption
Figure 5: Visual comparison with error maps of different methods on the TMA, HistoSR, and TCGA datasets. The error map represents the absolute error value between the output result and the ground truth. The brighter the color, the greater the error.

4.3 Comparison with previous methods

We compared the performance of ISTE with state-of-the-art SR methods in both the pathological image domain: SWD-Net [11] and Li et al. [8], and the natural image domain: Bicubic, EDSR [9], SwinIR [27], LIIF [15] and LTE [31], where the latter two are methods based on implicit neural representation. For a fair comparison, the backbone used for LIIF [15] and LTE [31] is also SwinIR [27] without upsampling layers.

Table 1: Quantitative results of the proposed ISTE compared to state-of-the-art methods on the TMA, TCGA, and HistoSR datasets
In-distribution Out-of-distribution
×2 ×3 ×4 ×6 ×8
Dataset Methods PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values
Bicubic 28.54±2.890 0.8931±0.0474 <0.001/<0.001 25.25±2.932 0.7708±0.1004 <0.001/<0.001 23.43±2.915 0.6735±0.1407 <0.001/<0.001 21.50±2.868 0.5647±0.1839 <0.001/<0.001 20.44±2.849 0.5123±0.2042 <0.001/<0.001
EDSR[9] 30.54±2.792 0.9370±0.0272 <0.001/<0.001 26.38±2.880 0.8228±0.0782 <0.001/<0.001 24.94±2.884 0.7652±0.1014 <0.001/<0.001 - - - - - -
SwinIR[27] 31.20±2.747 0.9438±0.0247 <0.001/<0.001 28.18±2.939 0.8773±0.0563 <0.001/<0.001 26.26±2.954 0.8092±0.0868 <0.001/<0.001 - - - - - -
Li et al.[8] 29.50±2.754 0.9211±0.0334 <0.001/<0.001 26.09±2.801 0.8207±0.0779 <0.001/<0.001 24.06±2.770 0.7206±0.1211 <0.001/<0.001 - - - - - -
SWD-Net[11] 31.18±2.832 0.9430±0.0251 <0.001/<0.001 28.06±2.946 0.8746±0.0574 <0.001/<0.001 26.09±2.934 0.8024±0.0894 <0.001/<0.001 - - - - - -
LIIF[15] 30.76±2.562 0.9422±0.0253 <0.001/<0.001 27.84±2.794 0.8745±0.0572 <0.001/<0.001 25.87±2.858 0.7990±0.0908 <0.001/<0.001 23.50±2.886 0.6751±0.1425 <0.001/<0.001 22.05±2.874 0.5954±0.1741 <0.001/<0.001
LTE[31] 31.26±2.834 0.9434±0.0250 <0.001/<0.001 28.19±2.949 0.8784±0.0558 <0.001/<0.001 26.22±2.975 0.8077±0.0875 <0.001/<0.001 23.73±2.958 0.6806±0.1409 <0.001/<0.001 22.17±2.926 0.5974±0.1738 <0.001/<0.001
TMA ISTE(ours) 31.27±2.828 0.9444±0.0243 - 28.23±2.954 0.8809±0.0547 - 26.46±2.979 0.8160±0.0842 - 23.86±2.963 0.6851±0.1393 - 22.19±2.931 0.5965±0.1742 -
Bicubic 27.43±3.322 0.8585±0.0496 <0.001/<0.001 23.88±3.394 0.6999±0.0936 <0.001/<0.001 22.01±3.498 0.5770±0.1243 <0.001/<0.001 19.95±3.654 0.4259±0.1678 <0.001/<0.001 18.89±3.683 0.3529±0.1898 <0.001/<0.001
EDSR[9] 31.53±3.185 0.9407±0.0243 <0.001/= 0.001 27.81±3.261 0.8588±0.0559 <0.001/<0.001 25.76±3.218 0.7820±0.0853 <0.001/<0.001 - - - - - -
SwinIR[27] 31.51±3.213 0.9397±0.0243 <0.001/<0.001 27.89±3.167 0.8624±0.0551 <0.001/<0.001 25.90±3.213 0.7870±0.0822 <0.001/<0.001 - - - - - -
Li et al.[8] 28.98±3.133 0.9024±0.0360 <0.001/<0.001 25.34±3.117 0.7843±0.0750 <0.001/<0.001 23.50±3.164 0.6893±0.0992 <0.001/<0.001 - - - - - -
SWD-Net[11] 31.49±3.216 0.9393±0.0243 <0.001/<0.001 27.87±3.253 0.8595±0.0559 <0.001/<0.001 25.78±3.268 0.7810±0.0841 <0.001/<0.001 - - - - - -
LIIF[15] 31.56±3.212 0.9399±0.0243 <0.001/<0.001 28.03±3.270 0.8639±0.0549 <0.001/<0.001 25.93±3.310 0.7862±0.0820 <0.001/<0.001 22.94±3.498 0.6279±0.1195 <0.001/<0.001 20.87±3.821 0.4889±0.1598 <0.001/<0.001
LTE[31] 31.58±3.244 0.9403±0.0242 <0.001/<0.001 28.03±3.286 0.8647±0.0545 <0.001/<0.001 25.93±3.317 0.7872±0.0816 <0.001/<0.001 22.95±3.500 0.6298±0.1192 <0.001/<0.001 20.89±3.815 0.4909±0.1588 <0.001/<0.001
HistoSR ISTE(ours) 31.65±3.252 0.9410±0.0239 - 28.14±3.299 0.8673±0.0540 - 26.05±3.327 0.7909±0.0813 - 23.01±3.508 0.6331±0.1186 - 20.94±3.828 0.4948±0.1586 -
Bicubic 32.98±0.962 0.9353±0.0127 <0.001/<0.001 28.12±0.858 0.8070±0.0271 <0.001/<0.001 25.63±0.844 0.6874±0.0345 <0.001/<0.001 23.05±0.873 0.5354±0.0401 <0.001/<0.001 21.64±0.913 0.4606±0.0438 <0.001/<0.001
EDSR[9] 36.14±0.962 0.9709±0.0063 <0.001/<0.001 31.16±0.914 0.9010±0.0183 <0.001/<0.001 28.01±0.840 0.8074±0.0278 <0.001/<0.001 - - - - - -
SwinIR[27] 36.73±0.971 0.9731±0.0058 <0.001/<0.001 31.77±0.895 0.9094±0.0167 <0.001/<0.001 28.83±0.813 0.8258±0.0251 <0.001/<0.001 - - - - - -
Li et al.[8] 34.61±0.842 0.9580±0.0073 <0.001/<0.001 29.89±0.816 0.8725±0.0188 <0.001/<0.001 26.57±0.769 0.7358±0.0280 <0.001/<0.001 - - - - - -
SWD-Net[11] 36.76±0.965 0.9734±0.0058 <0.001/<0.001 31.73±0.914 0.9074±0.0172 <0.001/<0.001 28.85±0.864 0.8219±0.0260 <0.001/<0.001 - - - - - -
LIIF[15] 36.92±0.957 0.9742±0.0055 <0.001/<0.001 31.99±0.911 0.9110±0.0163 <0.001/<0.001 29.08±0.866 0.8275±0.0251 <0.001/<0.001 25.55±0.829 0.6641±0.0349 <0.001/<0.001 23.72±0.859 0.5609±0.0398 <0.001/<0.001
LTE[31] 36.99±0.975 0.9748±0.0056 <0.001/<0.001 31.98±0.908 0.9109±0.0164 <0.001/<0.001 29.11±0.866 0.8280±0.0250 <0.001/<0.001 25.52±0.823 0.6617±0.0349 <0.001/<0.001 23.67±0.853 0.5580±0.0398 <0.001/<0.001
TCGA ISTE(ours) 37.76±1.034 0.9796±0.0050 - 32.06±0.914 0.9124±0.0163 - 29.19±0.867 0.8307±0.0247 - 25.61±0.821 0.6674±0.0342 - 23.76±0.856 0.5637±0.0395 -
Table 2: Quantitative results of the proposed ISTE compared to other arbitrary-scale SR methods on the TCGA datasets at non-integer scales.
×1.5 ×2.4 ×3.3 ×4.2 ×5.1
TCGA PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values
LIIF[15] 42.95±0.938 0.9962±0.0010 <0.001/<0.001 34.60±0.940 0.9532±0.0096 <0.001/<0.001 30.08±0.858 0.8777±0.0197 <0.001/<0.001 27.92±0.832 0.8018±0.0266 <0.001/<0.001 26.64±0.821 0.7285±0.0315 <0.001/<0.001
LTE[31] 43.34±0.951 0.9968±0.0009 <0.001/<0.001 34.61±0.943 0.9532±0.0096 <0.001/<0.001 30.08±0.858 0.8775±0.0197 <0.001/<0.001 27.93±0.832 0.8017±0.0266 <0.001/<0.001 26.62±0.814 0.7267±0.0316 <0.001/<0.001
ISTE(ours) 44.46±0.895 0.9982±0.0006 - 34.91±0.985 0.9568±0.0094 - 30.14±0.859 0.8791±0.0196 - 28.02±0.834 0.8053±0.0263 - 26.71±0.815 0.7312±0.0309 -

4.3.1 Quantitative results

We compared our ISTE with competitors at five scaling factors of ×2absent2\times 2× 2, ×3absent3\times 3× 3, ×4absent4\times 4× 4, ×6absent6\times 6× 6, and ×8absent8\times 8× 8. As shown in Table 1, our ISTE achieved the highest performance in terms of PSNR and SSIM metrics at each scaling factor on the HistoSR and TCGA datasets. Although our method’s SSIM metric at ×8absent8\times 8× 8 is slightly lower than LTE by 0.0009 on the TMA dataset, it outperforms the comparison method in PSNR metrics at all scaling factors and SSIM metrics at other scaling factors. We evaluate the significant difference between our ISTE and other methods using paired student’s t-tests. P<0.001𝑃0.001{P\textless 0.001}italic_P < 0.001 was considered as a statistically significant level. We report the specific value for p-values a little bit larger than 0.001, while those smaller than 0.001 are not given a specific value. As can be seen from the p-values in Table 1, there is a statistically significant difference with p-values smaller than 0.001 in almost all cases. To further assess the advantages of our method over other arbitrary scale SR Methods, we present comparative results in Table 2 for ISTE, LTE [14], and LIIF [32] at non-integer scaling factors. Our method demonstrates superior performance in terms of both PSNR and SSIM metrics. We also provide the Frechet Inception Distance (FID) score metric to evaluate the perceptual quality of images generated by different methods in Table 3. The results indicate that the textures of images generated by our method are more realistic, yielding perceptual effects superior to other arbitrary-scale SR methods. Please refer to supplementary materials for more comparisons.

Table 3: FID scores between the reconstructed images and the raw HR images.
FID score\downarrow
Dataset Methods ×2 ×3 ×4 ×6 ×8
LIIF[15] 1.23 2.92 17.58 88.64 120.62
LTE[31] 1.22 2.96 17.22 90.37 124.30
TCGA ISTE(ours) 1.07 2.86 16.45 88.62 122.12
LIIF[15] 3.63 6.11 17.14 53.55 82.50
LTE[31] 3.15 5.39 15.40 53.66 82.74
TMA ISTE(ours) 2.77 4.74 13.53 49.27 75.32
LIIF[15] 9.24 39.00 76.69 130.53 156.85
LTE[31] 9.54 39.05 77.06 130.56 154.28
HistoSR ISTE(ours) 8.92 37.82 75.45 128.81 153.27

4.3.2 Qualitative results

Fig.5 shows the visual results and absolute error maps of different methods on the TCGA datasets at the scale of ×4, TMA datasets at the scale of ×2, and HistoSR datasets at the scale of ×2. Our proposed method performs better in restoring texture information, closely approximating the ground truth. Based on the brightness levels in the absolute error maps, it is observable that our method’s error maps contain more dark regions, indicating more minor errors in the reconstructed results compared to other methods. Fig.6 shows an SR example of a comparison of LIIF and our ISTE at non-integer scales. It can be seen that ISTE achieves arbitrary-scale SR with clear cell structure and texture. As shown in the red box, two cells are connected due to blurring in the image generated by LIIF while they are still separated in the image generated by ISTE at the scale of ×7.3.

Refer to caption
Figure 6: Comparison of LIIF (upper row) and our ISTE (lower row) at non-integer scales.

4.4 Ablation study

To validate the effectiveness of each module in our proposed method, including the LFI, TL, STF, and LTD, we designed several variant networks for ablation experiments at scaling factors of ×2, ×3, and ×4 on the TCGA dataset.

Table 4: Ablation Study on the TCGA Dataset.
Model ×2 ×3 ×4
Dual-Branch Single-Branch TL LFI STF LTD LPD PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values PSNR↑ SSIM↑ P𝑃Pitalic_P values
× × × × 37.45±1.041 0.9778±0.0053 <0.001/<0.001 32.02±0.910 0.9115±0.0163 <0.001/<0.001 29.14±0.866 0.8290±0.0248 <0.001/<0.001
× × × × 37.44±1.032 0.9778±0.0053 <0.001/<0.001 32.01±0.910 0.9115±0.0163 <0.001/<0.001 29.14±0.866 0.8290±0.0248 <0.001/<0.001
× × 37.63±1.041 0.9789±0.0052 <0.001/<0.001 32.04±0.912 0.9120±0.0163 <0.001/<0.001 29.17±0.867 0.8302±0.0248 <0.001/<0.001
× × 37.66±1.037 0.9791±0.0051 <0.001/<0.001 32.04±0.913 0.9121±0.0163 <0.001/<0.001 29.17±0.867 0.8301±0.0248 <0.001/<0.001
× × 37.64±1.039 0.9790±0.0051 <0.001/<0.001 32.04±0.913 0.9121±0.0163 <0.001/<0.001 29.17±0.867 0.8301±0.0248 <0.001/<0.001
× × 37.61±1.037 0.9788±0.0052 <0.001/<0.001 32.04±0.911 0.9121±0.0163 <0.001/<0.001 29.18±0.867 0.8303±0.0248 <0.001/<0.001
× 37.76±1.034 0.9796±0.0050 <0.001/<0.001 32.06±0.914 0.9124±0.0163 <0.001/<0.001 29.19±0.867 0.8307±0.0247 <0.001/<0.001

4.4.1 Evaluation of the local feature interactor

For the features obtained from the encoder FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT, the LFI enhances the interaction of features within local regions. To investigate the effectiveness of this module, we conducted an ablation experiment by directly removing the LFI from the ISTE framework. As shown in Table 4, all metrics are improved at all scaling factors using the LFI.

4.4.2 Evaluation of the texture learner

The TL is employed to enhance the learning of high-frequency textures in pathological images. To investigate the effectiveness of this module, we conducted an ablation experiment by replacing the module with a convolutional layer. As shown in Table 4, it can be seen that after ablating the TL, all metrics become worse at all scaling factors. To better illustrate the role of the TL, we visualized the features input to the TL and output from the TL, denoted as FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT and FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT, respectively, in Fig.7. Compared to FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT, the output feature map FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT from the TL contains richer texture information.

Refer to caption
Figure 7: Feature map visualization for the texture learner. FLRsubscript𝐹𝐿𝑅F_{LR}italic_F start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT represents the feature map input to the texture learner and FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT represents the feature map output from the texture learner.

4.4.3 Evaluation of the self-texture fusion module

The STF module globally retrieves texture features that are most similar to FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT in FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT and fuses the retrieved features to FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT. We designed a variant network without this module to evaluate its effectiveness. Specifically, we first take the feature FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT obtained from the feature aggregation branch of the framework and decode it directly through the LPD to obtain ILPDsuperscriptsubscript𝐼𝐿𝑃𝐷I_{LPD}^{\prime}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, we take the feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT obtained from the texture learning branch and decode it through the LTD to obtain ILTDsuperscriptsubscript𝐼𝐿𝑇𝐷I_{LTD}^{\prime}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We sum ILPDsuperscriptsubscript𝐼𝐿𝑃𝐷I_{LPD}^{\prime}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ILTDsuperscriptsubscript𝐼𝐿𝑇𝐷I_{LTD}^{\prime}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to get the output of the variant network IPredsuperscriptsubscript𝐼𝑃𝑟𝑒𝑑I_{Pred}^{\prime}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. As shown in Table 4, all metrics become worse at all scaling factors after ablating the STF module. To illustrate the effectiveness of the STF module more intuitively, we visualized the path of the STF module to retrieve texture features on the TMA dataset in Fig.8. For the LR Patch during one training iteration, the starting point of the blue arrow is the position of the texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT retrieved by the STF module. The arrow points to the position where the feature FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT needs to be enhanced and fused with the retrieved texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT. We visualize a proportion of the sampling pixels for a better demonstration in Fig.8. It can be seen that the STF module can effectively use similar tissue texture segments and cellular structure features in pathological images to assist reconstruction.

Refer to caption
Figure 8: Visualization of texture similarity retrieval for the STF module, where the blue arrow starting position indicates the position of the texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT retrieved by the STF module. The arrow points to the position where the pixel feature FLFICsubscript𝐹𝐿𝐹𝐼𝐶F_{LFIC}italic_F start_POSTSUBSCRIPT italic_L italic_F italic_I italic_C end_POSTSUBSCRIPT needs to be enhanced and fused with the retrieved texture feature FTLsubscript𝐹𝑇𝐿F_{TL}italic_F start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT.

4.4.4 Evaluation of the texture decoder for spatial domain-based enhancement

The feature FSTFsubscript𝐹𝑆𝑇𝐹F_{STF}italic_F start_POSTSUBSCRIPT italic_S italic_T italic_F end_POSTSUBSCRIPT is decoded into the pixel information ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT in the spatial domain by the LPD. To accomplish spatial domain-based texture enhancement in the subsequent stage, LTD is employed to decode texture features acquired by the TL directly into spatial domain texture information ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT, and we sum ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT with ILPDsubscript𝐼𝐿𝑃𝐷I_{LPD}italic_I start_POSTSUBSCRIPT italic_L italic_P italic_D end_POSTSUBSCRIPT to obtain IPredsubscript𝐼𝑃𝑟𝑒𝑑I_{Pred}italic_I start_POSTSUBSCRIPT italic_P italic_r italic_e italic_d end_POSTSUBSCRIPT. To demonstrate the effectiveness of the designed Spatial domain-based enhancement strategy, we removed the LTD in the ISTE framework and utilized only the pixels decoded by the LPD as the final prediction results. The results in Table 4 suggest that incorporating spatial domain-based texture enhancement can lead to improved results. To better illustrate the effectiveness of the spatial domain-based enhancement, we visualized the pixel information decoded by the LPD and the texture information decoded by the LTD in the framework of ISTE in Fig.9. It can be seen that the texture information ILTDsubscript𝐼𝐿𝑇𝐷I_{LTD}italic_I start_POSTSUBSCRIPT italic_L italic_T italic_D end_POSTSUBSCRIPT decoded with LTD reveals clear outlines and texture features of the tissue cells and has more vibrant colors. This further illustrates the importance of LTD for spatial domain-based enhancement.

Refer to caption
Figure 9: (a) Input LR image; (b) Pixel information decoded by the LPD; (c) Texture information decoded by the LTD; (d) Output of the spatial domain-based enhancement; (e) Ground truth.

4.4.5 Evaluation of the dual-branch architecture

We designed two single-branch variant networks to evaluate the effectiveness of the proposed dual-branch architecture: (1) retaining only the TL and LTD in the ISTE framework and (2) retaining only the LFI and LPD in the ISTE framework. As shown in Table 4, the performance of the single-branch architecture is degraded compared to the dual-branch architecture.

Table 5: Quantitative evaluation results of U-Net for gland segmentation on the GlaS dataset under different experimental settings.
Experiment F1 ObjDice ObjHausdorff
Test A Test B Test A Test B Test A Test B
Bicubic 0.71 0.85 0.83 0.88 133.73 109.21
HR U-Net 0.84 0.88 0.89 0.92 100.57 84.64
SISR 0.92 0.93 0.94 0.95 77.74 65.81
Original high resolution 0.95 0.93 0.96 0.96 66.70 61.17

4.5 Downstream task experiments

In this section, we experimentally demonstrate that the proposed SR method effectively enhances the performance of two downstream tasks: gland segmentation and malignancy classification. First, for the gland segmentation task, we trained and tested the state-of-the-art segmentation model U-Net [42] on the Glas dataset from the MICCAI 2015 Gland Segmentation Challenge [43]. The Glas dataset consists of a training set and two test sets, Test A and Test B. The training set contains 85 images and the corresponding labels, Test A contains 60 images and the corresponding labels, and Test B contains 20 images and the corresponding labels. We performed ×4 downsampling on HR images to generate LR images using bicubic interpolation. We compared segmentation results under the following settings: (1) Original High-Resolution: Train U-Net on the original HR GlaS dataset for segmentation of original high-resolution images; (2) SISR: Directly employing U-Net trained on the original HR GlaS dataset for segmentation of the reconstructed images generated by the SISR model. (3) HR U-Net: Train U-Net on the reconstructed images generated by the SISR model for segmentation of original HR images; (4) Bicubic: Train U-Net on LR images obtained after bicubic interpolation for segmentation of original HR images. Table 5 shows the quantitative test results, where larger values indicate better performance for the F1 score and Object Dice score, while smaller values indicate better performance for object Hausdorff distance. It can be seen that the U-Net model trained on the reconstructed images of the SISR model performs better than the UNet model trained on the LR image dataset after bicubic interpolation, showing higher F1 scores and object Dice scores, as well as lower object Hausdorff distances. In particular, when tested on the Test B dataset, our results for segmentation of reconstructed images using U-Net trained on the original HR GlaS training set are close to those for segmentation of the original HR image, both with an F1 score of 0.93. Fig.10 shows representative results for different experimental setups, and it can be observed that U-Net trained on LR images produced the worst results; not only did it fail to detect small glands, but also the segmentation results of large glands appeared to be crippled. In contrast, the U-Net trained on the reconstructed image could outline the boundaries of the macro glands and detect the tiny glands. Compared to using LR images for training, using the generated SR images for training can improve the segmentation accuracy during testing.

Refer to caption
Figure 10: Quantitative evaluation of UNet for gland segmentation on the GlaS dataset [43] with different experiment setups.

To further evaluate the contribution of the SR method to the malignancy classification task, we conducted tumor recognition on the PCam dataset [44]. The PCam dataset comprises 262,144 color images for training and 32,768 images for testing, with each image annotated with a binary label indicating the presence of metastatic tissue. We performed ×2 downsampling on HR images of the test set to generate LR images using bicubic interpolation. The ResNet-50 [45] was chosen as the classifier and trained on the original PCam dataset. We compared classification results under the following settings: (1) Original: Directly employing trained ResNet-50 model to test on the original HR images in the test set; (2) Low Resolution: Directly employing trained ResNet-50 model to test on the LR images of the test set; (3) Bicubic: Directly employing trained ResNet-50 model to test on the bicubic images of the test set; (4) LIIF: Directly employing trained ResNet-50 model to test on the images generated by LIIF from the LR test set images; (5) ISTE: Directly employing trained ResNet-50 model to test on the images generated by our method ISTE from the LR test set images; Table 6 illustrates the enhancement in diagnostic performance by the SR method. By introducing additional prior knowledge, our ISTE leads to a performance improvement, with an accuracy increase of 4.06% compared to Bicubic. These results indicate that ISTE can improve classification performance by recovering more distinctive details.

Table 6: The performance promotion of malignancy classification task under different SR methods using ResNet-50.
Experiment Accuracy F1 score
Original 86.17% 0.8507
Low Resolution 58.11% 0.2929
Bicubic 77.09% 0.7419
LIIF 80.54% 0.7721
ISTE(ours) 81.15% 0.7816

5 Conclusion

In this work, we propose an innovative dual-branch framework ISTE based on self-texture enhancement, which achieves SR of pathology images at arbitrary magnification for the first time. ISTE consists of a feature aggregation branch and a texture-learning branch. We employ the feature aggregation branch to enhance the learning of the features’ relevance in the local region while utilizing the texture learning branch to enhance the learning of high-frequency texture details. Then, we design a two-stage texture enhancement strategy to fuse the features from the two branches to obtain the SR images, where the first stage is feature-based texture enhancement, and the second stage is spatial-domain-based texture enhancement. Extensive experiments on publicly available datasets show that ISTE performs better than currently available fixed-scale and arbitrary-scale SR algorithms at multiple scaling factors. Further experiments show that our method can improve the performance of two downstream tasks. In the future, we will continue to work on lightweight models and integrate the proposed SR models with existing diagnostic networks to improve diagnostic performance.

CRediT authorship contribution statement

Minghong Duan: Writing – original draft, Software, Methodology, Investigation, Conceptualization. Linhao Qu: Writing – original draft, Validation, Supervision, Methodology, Data curation, Conceptualization. Zhiwei Yang: Validation, Software, Investigation. Manning Wang: Methodology, Supervision, Validation, Writing – review & editing. Chenxi Zhang: Resources, Supervision, Validation, Writing – review & editing. Zhijian Song: Resources, Validation, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • [1] J. R. Gilbertson, J. Ho, L. Anthony, D. M. Jukic, Y. Yagi, A. V. Parwani, Primary histologic diagnosis using automated whole slide imaging: a validation study, BMC clinical pathology 6 (2006) 1–19.
  • [2] L. Pantanowitz, P. N. Valenstein, A. J. Evans, K. J. Kaplan, J. D. Pfeifer, D. C. Wilbur, L. C. Collins, T. J. Colgan, Review of the current state of whole slide imaging in pathology, Journal of pathology informatics 2 (1) (2011) 36.
  • [3] R. S. Weinstein, M. R. Descour, C. Liang, G. Barker, K. M. Scott, L. Richter, E. A. Krupinski, A. K. Bhattacharyya, J. R. Davis, A. R. Graham, et al., An array microscope for ultrarapid virtual slide processing and telepathology. design, fabrication, and validation study, Human pathology 35 (11) (2004) 1303–1314.
  • [4] D. C. Wilbur, Digital cytology: current state of the art and prospects for the future, Acta cytologica 55 (3) (2011) 227–238.
  • [5] F. Ghaznavi, A. Evans, A. Madabhushi, M. Feldman, Digital imaging in pathology: whole-slide imaging and beyond, Annual Review of Pathology: Mechanisms of Disease 8 (2013) 331–359.
  • [6] P. S. Nielsen, J. Lindebjerg, J. Rasmussen, H. Starklint, M. Waldstrøm, B. Nielsen, Virtual microscopy: an evaluation of its validity and diagnostic performance in routine histologic diagnosis of skin tumors, Human pathology 41 (12) (2010) 1770–1776.
  • [7] A. Madabhushi, G. Lee, Image analysis and machine learning in digital pathology: Challenges and opportunities, Medical image analysis 33 (2016) 170–175.
  • [8] B. Li, A. Keikhosravi, A. G. Loeffler, K. W. Eliceiri, Single image super-resolution for whole slide image using convolutional neural networks and self-supervised color normalization, Medical Image Analysis 68 (2021) 101938.
  • [9] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks for single image super-resolution, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
  • [10] L. Mukherjee, A. Keikhosravi, D. Bui, K. W. Eliceiri, Convolutional neural networks for whole slide image superresolution, Biomedical optics express 9 (11) (2018) 5368–5386.
  • [11] Z. Chen, X. Guo, C. Yang, B. Ibragimov, Y. Yuan, Joint spatial-wavelet dual-stream network for super-resolution, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part V 23, Springer, 2020, pp. 184–193.
  • [12] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, G. Wetzstein, Implicit neural representations with periodic activation functions, Advances in neural information processing systems 33 (2020) 7462–7473.
  • [13] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, R. Ng, Fourier features let networks learn high frequency functions in low dimensional domains, Advances in Neural Information Processing Systems 33 (2020) 7537–7547.
  • [14] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng, Nerf: Representing scenes as neural radiance fields for view synthesis, Communications of the ACM 65 (1) (2021) 99–106.
  • [15] Y. Chen, S. Liu, X. Wang, Learning continuous image representation with local implicit image function, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8628–8638.
  • [16] C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, Springer, 2014, pp. 184–199.
  • [17] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image super-resolution, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2472–2481.
  • [18] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
  • [19] L. Cavigelli, P. Hager, L. Benini, Cas-cnn: A deep convolutional neural network for image compression artifact suppression, in: 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 752–759.
  • [20] J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1646–1654.
  • [21] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. Change Loy, Esrgan: Enhanced super-resolution generative adversarial networks, in: Proceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 0–0.
  • [22] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image restoration, IEEE transactions on pattern analysis and machine intelligence 43 (7) (2020) 2480–2495.
  • [23] Y. Chen, T. Pock, Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration, IEEE transactions on pattern analysis and machine intelligence 39 (6) (2016) 1256–1272.
  • [24] X. Deng, Y. Zhang, M. Xu, S. Gu, Y. Duan, Deep coupled feedback network for joint exposure fusion and image super-resolution, IEEE Transactions on Image Processing 30 (2021) 3098–3112.
  • [25] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang, X. Cao, H. Shen, Single image super-resolution via a holistic attention network, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, 2020, pp. 191–207.
  • [26] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, W. Gao, Pre-trained image processing transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12299–12310.
  • [27] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, R. Timofte, Swinir: Image restoration using swin transformer, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1833–1844.
  • [28] X. Chen, X. Wang, J. Zhou, Y. Qiao, C. Dong, Activating more pixels in image super-resolution transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22367–22377.
  • [29] D. Liu, B. Wen, Y. Fan, C. C. Loy, T. S. Huang, Non-local recurrent network for image restoration, Advances in neural information processing systems 31 (2018).
  • [30] Y. Mei, Y. Fan, Y. Zhou, Image super-resolution with non-local sparse attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3517–3526.
  • [31] J. Lee, K. H. Jin, Local texture estimator for implicit representation function, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1929–1938.
  • [32] U. Upadhyay, S. P. Awate, A mixed-supervision multilevel gan framework for image quality enhancement, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 556–564.
  • [33] A. Juhong, B. Li, C.-Y. Yao, C.-W. Yang, D. W. Agnew, Y. L. Lei, X. Huang, W. Piyawattanametha, Z. Qiu, Super-resolution and segmentation deep learning for breast cancer histopathology image analysis, Biomedical Optics Express 14 (1) (2023) 18–36.
  • [34] F. Shahidi, Breast cancer histopathology image super-resolution using wide-attention gan with improved wasserstein gradient penalty and perceptual loss, IEEE Access 9 (2021) 32795–32809.
  • [35] X. Wu, Z. Chen, C. Peng, X. Ye, Mmsrnet: Pathological image super-resolution by multi-task and multi-scale learning, Biomedical Signal Processing and Control 81 (2023) 104428.
  • [36] J. Ma, S. Liu, S. Cheng, R. Chen, X. Liu, L. Chen, S. Zeng, Stsrnet: Self-texture transfer super-resolution and refocusing network, IEEE Transactions on Medical Imaging 41 (2) (2021) 383–393.
  • [37] Z. Zhang, Z. Wang, Z. Lin, H. Qi, Image super-resolution by neural texture transfer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7982–7991.
  • [38] C.-M. Feng, Y. Yan, H. Fu, L. Chen, Y. Xu, Task transformer network for joint mri reconstruction and super-resolution, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, Springer, 2021, pp. 307–317.
  • [39] C. R. Drifka, A. G. Loeffler, K. Mathewson, A. Keikhosravi, J. C. Eickhoff, Y. Liu, S. M. Weber, W. J. Kao, K. W. Eliceiri, Highly aligned stromal collagen is a negative prognostic factor following pancreatic ductal adenocarcinoma resection, Oncotarget 7 (46) (2016) 76197.
  • [40] C. R. Drifka, J. Tod, A. G. Loeffler, Y. Liu, G. J. Thomas, K. W. Eliceiri, W. J. Kao, Periductal stromal collagen topology of pancreatic ductal adenocarcinoma differs from that of normal and chronic pancreatitis, Modern Pathology 28 (11) (2015) 1470–1480.
  • [41] B. Li, Y. Li, K. W. Eliceiri, Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14318–14328.
  • [42] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
  • [43] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, et al., Gland segmentation in colon histology images: The glas challenge contest, Medical image analysis 35 (2017) 489–502.
  • [44] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling, Rotation equivariant cnns for digital pathology, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11, Springer, 2018, pp. 210–218.
  • [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.