11institutetext: Shandong University 22institutetext: The University of Science and Technology Beijing 33institutetext: The University of Pennsylvania
33email: [email protected]

XctDiff: Reconstruction of CT Images with Consistent Anatomical Structures from a Single Radiographic Projection Image

Qingze Bai 11    Tiange Liu **22    Zhi Liu Yubing Tong 1133    Drew Torigian 33    Jayaram Udupa 33
Abstract

In this paper, we present XctDiff, an algorithm framework for reconstructing CT from a single radiograph, which decomposes the reconstruction process into two easily controllable tasks: feature extraction and CT reconstruction. Specifically, we first design a progressive feature extraction strategy that is able to extract robust 3D priors from radiograph. Then, we use the extracted prior information to guide the CT reconstruction in the latent space. Moreover, we design a homogeneous spatial codebook to improve the reconstruction quality further. The experimental results show that our proposed method achieves state-of-the-art reconstruction performance and overcome the blurring issue. We also apply XctDiff on self-supervised pre-training task. The effectiveness indicates that it has promising additional applications medical image analysis. The code is available at: https://github.com/qingze-bai/XctDiff

Keywords:
Reconstruction Radiography Computed Tomography

1 Introduction

Radiography and Computed Tomography (CT) are prevalent non-invasive imaging techniques in clinical medicine. They share similar imaging principles but have differences in application scenarios [3, 7]. Radiography, with low radiation exposure, is used for preliminary examinations like diagnosing bone fractures and pneumonia due to limited 3D information. In contrast, CT, with higher radiation exposure, can capture more intricate structures and lesions, making it suitable for diagnosing and treating complex diseases. Given these distinctions, an intriguing question arises: can radiograph be reconstructed as CT by deep learning and replace its function on certain tasks?

Theoretically, a radiograph is formed when detector receives X-rays that have been attenuated by the body and converts them into electrical signals. CT scan, on the other hand, obtains multiple X-ray projection data from different camera positions and then reconstructs them based on the Radon transform. Based on similar imaging principles, converting a CT volume to a radiograph can be considered as a relatively straightforward lossy compression process (e.g., digital radiography (DRR) technology), while reversing this process poses a challenging single-view reconstruction problem.

Recently, some data-driven reconstruction algorithms have demonstrated the ability to generate 3D objects from single natural images, such as based on depth estimation [11, 15] or implicit representations [10]. However, the differences in imaging principles between natural and medical images, as well as the requirement for internal structure modeling in medical images, result in single-view reconstruction in the medical image domain not benefiting from these methods. Some studies [20] have used two orthogonal projections to reconstruct the CT image, but this requires specialized design and equipment. Others [13, 18] have employed Convolution Neural Network (CNN) or Generative Adversarial Network (GAN) to learn the mapping function from radiographs to CT scans. However, these methods suffer from severe image degradation and lack evaluation for potential applications.

Refer to caption
Figure 1: The XctDiff utilizes a progressive semantic encoder to extract 3D anatomical priors from input radiograph image. The extracted features are then used to guide CT reconstruction in latent space. Finally, the reconstructed CT feature maps are used to generate high-quality CT images with consistent anatomical structures after passing through a vector quantization encoder. Note that the radiographs used in the training were converted using Digitally Reconstructed Radiography (DRR) technology. The radiographs in the inference stage are converted from real radiographs through style transfer model.

In this paper, we propose a radiograph to computed tomography image diffusion model (XctDiff), an algorithmic framework for CT image reconstruction from a single radiograph. As depicted in Fig. 1, XctDiff consists of three main components: a perceptual compression model, a progressive semantic encoder for extracting 3D structural information from 2D radiograph, and a conditional diffusion model (DM). We decompose the reconstruction process into two parts: feature extraction and 3D reconstruction. Specifically, we extract anatomical information from radiograph and then utilize the extracted prior information to guide CT image reconstruction. We also proposed a homogeneous spatial codebook to enhance the reconstruction quality further. After all, all components are integrated and generalized for real radiographs. The reconstructed CT images are utilized on self-supervised pre-training task. Experimental results demonstrate that XctDiff achieves state-of-the-art reconstruction performance.

2 Method

2.1 3D Perceptual Compression Model

Reconstructing CT images is equivalent to learning volume structure, which presents greater computational and pattern coverage challenges than 2D images. Inspired by latent generation models [5, 12], we adopt a similar compression design to accommodate a 3D format input. For the architecture aspect, a homogeneous spatial codebook and an extra super-resolution module [14] are incorporated to improve the reconstruction quality. For the loss function, we use 2D and 3D discriminators to optimize jointly the whole and slice.

Specifically, for given input CT xH×W×D𝑥superscript𝐻𝑊𝐷x\in\mathbb{R}^{H\times W\times D}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, it is compressed by encoder E𝐸Eitalic_E into a latent representation zh×w×d×nz𝑧superscript𝑤𝑑subscript𝑛𝑧z\in\mathbb{R}^{h\times w\times d\times n_{z}}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where h,w𝑤h,witalic_h , italic_w, and d𝑑ditalic_d denote the feature map size in the latent space, nzsubscript𝑛𝑧n_{z}italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represents the dimension of the codebook entries. Each spatial representation is then quantized element-by-element into the closest codebook entity, which can be expressed as:

zq=q(z):=minzn𝒵zijkznsubscript𝑧𝑞𝑞𝑧assignsubscriptsubscript𝑧𝑛𝒵normsubscript𝑧𝑖𝑗𝑘subscript𝑧𝑛z_{q}=q(z):=\min_{z_{n}\in\mathcal{Z}}\left\|z_{ijk}-z_{n}\right\|italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_q ( italic_z ) := roman_min start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_Z end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ (1)

where znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and n𝑛nitalic_n denote the codebook entity and the number of entities, respectively. 𝒵𝒵\mathcal{Z}caligraphic_Z is the vector quantization encoder.

To ensure perceptually rich variance space for generation, we adopt the heuristic strategy of vector quantization encoder [5, 21] and utilize z-score normalization to transform the variance space into a homogeneous space, which avoids the smoothing generated by the encoder against the codebook and thus improves the quality of reconstruction. The training objectives of the codebook can be summarized as follows:

vq=𝒩(sg[E(x)])𝒩(zq)22+𝒩(sg[zq])𝒩(E(x))22subscript𝑣𝑞subscriptsuperscriptdelimited-∥∥𝒩𝑠𝑔delimited-[]𝐸𝑥𝒩subscript𝑧𝑞22subscriptsuperscriptdelimited-∥∥𝒩𝑠𝑔delimited-[]subscript𝑧𝑞𝒩𝐸𝑥22\begin{split}\mathcal{L}_{vq}=\left\|\mathcal{N}(sg[E(x)])-\mathcal{N}(z_{q})% \right\|^{2}_{2}\\ +\left\|\mathcal{N}(sg[z_{q}])-\mathcal{N}(E(x))\right\|^{2}_{2}\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT = ∥ caligraphic_N ( italic_s italic_g [ italic_E ( italic_x ) ] ) - caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + ∥ caligraphic_N ( italic_s italic_g [ italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] ) - caligraphic_N ( italic_E ( italic_x ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW (2)

where 𝒩𝒩\mathcal{N}caligraphic_N represents the z-score normalization, sg[]𝑠𝑔delimited-[]sg[\cdot]italic_s italic_g [ ⋅ ] denotes the stop-gradient operation. We also introduce a super-resolution module at the end of decoder G𝐺Gitalic_G to improve the reconstruction performance. In addition, we apply a combination of slice discriminator D2dsubscript𝐷2𝑑D_{2d}italic_D start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT and volume discriminator D3dsubscript𝐷3𝑑D_{3d}italic_D start_POSTSUBSCRIPT 3 italic_d end_POSTSUBSCRIPT to optimize further jointly the autoencoder. The optimization objective of the discriminator can be formulated as:

gan=α[logD3d(x)+log(1D3d(x^))]+β[logD2d(xk)+log(1D2d(x^k))]subscript𝑔𝑎𝑛𝛼delimited-[]subscript𝐷3𝑑𝑥1subscript𝐷3𝑑^𝑥𝛽delimited-[]subscript𝐷2𝑑subscript𝑥𝑘1subscript𝐷2𝑑subscript^𝑥𝑘\begin{split}\mathcal{L}_{gan}=\alpha[\log{D_{3d}(x)}+\log(1-D_{3d}(\hat{x}))]% \\ +\beta[\log{D_{2d}(x_{k})}+\log(1-D_{2d}(\hat{x}_{k}))]\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_g italic_a italic_n end_POSTSUBSCRIPT = italic_α [ roman_log italic_D start_POSTSUBSCRIPT 3 italic_d end_POSTSUBSCRIPT ( italic_x ) + roman_log ( 1 - italic_D start_POSTSUBSCRIPT 3 italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ) ] end_CELL end_ROW start_ROW start_CELL + italic_β [ roman_log italic_D start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_log ( 1 - italic_D start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW (3)

where x𝑥xitalic_x and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG represent ground truth and prediction, respectively. xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the kth𝑘𝑡kthitalic_k italic_t italic_h slice in the CT images. Finally, the optimization function of the autoencoder is summarized as follows:

vqgan=λ1rec+λ2lpips+λ3vq+λ4gansubscript𝑣𝑞𝑔𝑎𝑛subscript𝜆1subscript𝑟𝑒𝑐subscript𝜆2subscript𝑙𝑝𝑖𝑝𝑠subscript𝜆3subscript𝑣𝑞subscript𝜆4subscript𝑔𝑎𝑛\mathcal{L}_{vqgan}=\lambda_{1}\mathcal{L}_{rec}+\lambda_{2}\mathcal{L}_{lpips% }+\lambda_{3}\mathcal{L}_{vq}+\lambda_{4}\mathcal{L}_{gan}caligraphic_L start_POSTSUBSCRIPT italic_v italic_q italic_g italic_a italic_n end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_g italic_a italic_n end_POSTSUBSCRIPT (4)

where recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT and lpipssubscript𝑙𝑝𝑖𝑝𝑠\mathcal{L}_{lpips}caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT denote L1 loss and perceptual loss, λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are weighted hyperparameters.

Refer to caption
Figure 2: (a) The progressive encoder firstly approximates the rough shape of the human body only in the coronal plane, then learns more accurate 3D anatomical representations through multiple successive convolutional layers. (b) Different styles of radiographs. (Left) Real radiographs from the ChestXray2017. (Middle) Synthesized radiographs used for training and evaluate. (Right) Synthesized style real radiographs.

2.2 Extract 3D Anatomical Information

Reconstruction of a CT image from a single radiograph requires 3D prior information. We decouple the mapping relationship between the radiograph and the corresponding CT scan into two phases: shape mapping and anatomical structure mapping. As shown in Fig 2(a), we proposed a progressive encoder PE𝑃𝐸PEitalic_P italic_E to obtain robust 3D priors by simplifying the complex mapping relationship. Specifically, for given an input radiograph p𝑝pitalic_p, we use multilayer perceptron (MLP) to convert the multilayer stacked slices into the human 3D space. The MLP only works in one direction since frontal projection radiograph matches in geometry a coronal slice, making it easier to learn the shape of the organs from this direction, which can be expressed as follows:

v^ijk=MLP(pij1,,pijk)subscript^𝑣𝑖𝑗𝑘𝑀𝐿𝑃superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑝𝑖𝑗𝑘\hat{v}_{ijk}=MLP(p_{ij}^{1},\cdot\cdot\cdot,p_{ij}^{k})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) (5)

where v𝑣vitalic_v and p𝑝pitalic_p denote the voxel of the CT image and the pixel of the radiograph, respectively. In the anatomical structure mapping stage, we globally optimize the extracted rough features by multiple successive 3D convolutional layers to obtain more accurate 3D anatomical priors. Finally, the extracted prior information is mapped into the homogeneous space by vector quantization encoder. Formally, this process can be summarized as:

y=𝒩(q(Conv(v^))y=\mathcal{N}(q(Conv(\hat{v}))italic_y = caligraphic_N ( italic_q ( italic_C italic_o italic_n italic_v ( over^ start_ARG italic_v end_ARG ) ) (6)

2.3 Prior Guided Diffusion Model

The DM [6, 16, 17] uses UNet θ𝜃\thetaitalic_θ to predict the added noise ε𝜀\varepsilonitalic_ε at each time step t𝑡titalic_t in the inverse process. To minimize the pattern coverage challenge, we apply convolution only in the image plane (i.e., kernel size of 3×3×1). For given anatomical priors y𝑦yitalic_y extracted from radiograph, the denoising encoder εθ(x,t,y)subscript𝜀𝜃𝑥𝑡𝑦\varepsilon_{\theta}(x,t,y)italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_t , italic_y ) controls the direction of generation based on the prior distribution via cross-attention. The optimization function of the conditional diffusion model can be written as:

dm=Ex,t,ε𝒩(0,I)(εεθ(xt,t,y)22)subscript𝑑𝑚subscript𝐸similar-to𝑥𝑡𝜀𝒩0𝐼superscriptsubscriptnorm𝜀subscript𝜀𝜃subscript𝑥𝑡𝑡𝑦22\mathcal{L}_{dm}=E_{x,t,\varepsilon\sim\mathcal{N}(0,I)}(\left\|\varepsilon-% \varepsilon_{\theta}(x_{t},t,y)\right\|_{2}^{2})caligraphic_L start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_x , italic_t , italic_ε ∼ caligraphic_N ( 0 , italic_I ) end_POSTSUBSCRIPT ( ∥ italic_ε - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (7)

where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the sampling of the diffusion process at the moment t𝑡titalic_t, which is effected by noise schedule αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3 Experiment

3.1 Datasets

Similar to previous works [13, 18, 20], we used digitally reconstructed radiograph (DRR) technology to translate real CT scans into corresponding radiographs. However, as shown in Fig 2(b), a disparity exists between synthetic radiographs and real ones in terms of realism and style, which may hinder model generalization and evaluation. Thus, when inferring real radiographs, we utilized CycleGAN [22] to transform real radiographs into synthetic-style radiographs.

Experiments were conducted on four public datasets. The LIDC-IDRI dataset [2], comprising 1018 CT scan images, was employed for training and validation of our proposed framework, with voxel resolutions resampled to [2.5,2.5,2.5] and cropped to [128,128,128] cube regions. Among these, 916 cases were used for training and 102 cases were used for testing. In addition, the ChestXray2017 dataset [8], containing 5856 radiographic images, was used for style conversion between real and synthetic radiographic images. We randomly selected 1000 normal radiographs without lesions from this dataset for style migration and validation. For the BCV [9] and MSD Spleen [1] datasets, which are abdominal segmentation datasets, 25 cases were randomly selected using 5-fold cross-validation, with 5 and 7 cases were used for testing, respectively.

3.2 Implementation Details

All experiments were conducted on 2 RTX3090 GPUs. The XctDiff framework comprises two main training phases. Initially, we trained a perceptual compression model with a batch size of 1, a learning rate of 2e-4, and 80,000 iterations. Subsequently, a progressive encoder was trained with a batch size of 16, a learning rate of 1e-4, and 50,000 iterations. In the second stage, a conditional DM in latent space was trained with a batch size of 8, a learning rate of 1e-4, and 100,000 iterations. For the self-supervised pre-training task, we utilized the code library provided by Swin UNETR [19]. Each model was cross-validated on the training dataset with 5 folds and trained for 50,000 iterations.

Table 1: Evaluation of single-view CT reconstruction quality on the LIDC-IDRI dataset, where the mean (and Std) are reported.
PSNR \uparrow SSIM \uparrow LPIPS \downarrow
ReconNet 22.28(±1.237plus-or-minus1.237\pm 1.237± 1.237) 0.470(±0.068plus-or-minus0.068\pm 0.068± 0.068) 0.237(±0.048plus-or-minus0.048\pm 0.048± 0.048)
X2CTCNN 22.47(±1.460plus-or-minus1.460\pm 1.460± 1.460) 0.495(±0.077plus-or-minus0.077\pm 0.077± 0.077) 0.223(±0.056plus-or-minus0.056\pm 0.056± 0.056)
X2CTGAN 22.66(±1.442plus-or-minus1.442\pm 1.442± 1.442) 0.503(±0.074plus-or-minus0.074\pm 0.074± 0.074) 0.217(±0.070plus-or-minus0.070\pm 0.070± 0.070)
XctDiff 23.57(±plus-or-minus\pm±1.205) 0.524(±plus-or-minus\pm±0.075) 0.123(±plus-or-minus\pm±0.069)
Refer to caption
Figure 3: Qualitative visualization results on the LIDC-IDRI dataset. The transverse plane, sagittal plane, and coronal plane of the reconstruction results are shown.

3.3 Results

Reconstruction Result. The quantitative results of CT reconstruction from a single radiograph are evaluated in the metrics of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). As shown in Table 1, it is evident that XctDiff outperforms the second-best method, X2CTGAN, by +0.9 PSNR(dB) and +2.1 SSIM(%). Furthermore, as demonstrated through qualitative visualization in Fig 3, our approach produces high-quality CT images with precise anatomical structures, e.g., the heart, lungs, spine, and ribs can be easily identified, which is crucial for downstream tasks. Experimental results show that CT reconstructed from a single X-ray image has much lower reconstruction accuracy than traditional CT images. But it also demonstrates the potential for certain downstream tasks.

Table 2: Ablation results on improving the structure and codebook of VQGAN.

Codebook

SR Module

Code Size

Latent Dim

PSNR \uparrow

SSIM \uparrow

LPIPS \downarrow

Codebook Usage \downarrow

Parameters(M)

FLOPs(T)

GPU (G)

Baseline -- -- 4096 3 31.95 0.750 0.060 1.8% 21.4 1.35 19.50
-- -- 8192 8 32.86 0.769 0.054 1.5% 21.5 1.35 19.50
Architecture -- 4096 3 32.64 0.781 0.056 97.4% 21.4 1.35 19.50
-- 4096 3 32.23 0.758 0.056 1.3% 22.0 1.67 22.90
4096 3 32.74 0.787 0.049 98.2% 22.0 1.67 22.90
Codebook 4096 8 33.35 0.796 0.048 96.6% 22.0 1.67 22.90
4096 16 33.01 0.783 0.051 55.2% 22.0 1.67 22.90
8192 8 33.43 0.787 0.049 98.2% 22.1 1.67 22.90
8192 16 33.34 0.793 0.055 24.7% 22.1 1.67 22.90
Table 3: Ablation research on XctDiff framework. Note that PE and AE represent the progressive semantic encoder and improved autoencoder.
PE AE PSNR \uparrow SSIM \uparrow LPIPS \downarrow
XctDiff -- -- 22.47 0.502 0.154
-- 23.26 0.517 0.134
-- 23.14 0.511 0.140
23.57 0.524 0.123

Ablation Study. Experiment results of image quality with different settings and improvements are presented in Tab. 2. We use two different settings as benchmarks. It can be observed that the utilization of the codebook is inefficient (only 1.8% and 1.5%, respectively), which suggests that CT images are always reconstructed based on a limited number of semantic features, resulting in a poor quality reconstruction. The third line demonstrates that the reconstructed quality is significantly improved by mapping the encoder output and codebook onto an isotropic Euclidean space. In the fourth line, the incorporation of a super-resolution module also improves the reconstruction quality. When these two components are combined, the reconstruction results surpass those achieved by the first baseline, exhibiting an improvement of +0.8 PSNR(dB), +3.7 SSIM(%) and -1.1 LPIPS(%), respectively. In addition, we conducted an analysis on the impact of varying codebook quantities and dimensions. The results show that the best performance was achieved with a dimension set of 8 and a codebook quantity of 8192, which we use as the default setting. As shown in the Tab. 3, we also performed ablation in the progressive semantic encoder and improved autoencoder, demonstrating their positive impact on enhancing the performance of XctDiff.

Refer to caption
Figure 4: Dice score gaps between pre-training with reconstructed CT, pre-training with real CT, and a scratch model on the BCV and MSD datasets.

Self-supervised Pre-training. Reconstructed CT from a single radiograph can facilitate the self-supervised pre-training of medical images by circumventing challenges in data acquisition and privacy limitations. As shown in Fig. 4, we demonstrate the efficacy of the reconstructed CT from real radiographs for the self-supervised pre-training task using two models based on different paradigms (UNet [4] and Swin UNETR [19]) on both BCV dataset and MSD Spleen dataset. The pre-training performance on the Reconstructed CT dataset is slightly lower than training on the real CT dataset but still with a significant improvement of +3.1 and +1.8 Dice(%), respectively. For the MSD Spleen dataset, the UNet [4] model pretrained on reconstructed CT images obtained the highest 93.7 Dice(%), demonstrating a significant improvement of +1.7 and +0.4 compared to the baseline model and the model pretrained on real CT images, respectively. In addition, the Swin UNETR pre-trained on reconstructed CT dataset also exhibited competitive performance.

4 Conclusion

In this paper, we present an algorithmic framework XctDiff for reconstructing CT from a single radiograph, and evaluate the reconstruction performance and improvement module in detail. To facilitate real-world applications, we employ a style transfer method to convert real radiographs into synthetic-style ones for CT image reconstruction. Moreover, we explore its potential in downstream applications. The self-supervised pre-training task demonstrate the benefits of our approach in the field of medical image analysis.

5 Acknowledgments

The research reported in this paper is partly supported by the National Natural Science Foundation of China under grant No. 62273293, Hebei Natural Science Foundation under grant No. F2023203030, Science Research Project of Hebei Education Department under grant No. QN2024010, Shandong Provincial Natural Science Foundation under grant No. ZR2022LZH002, and partly by grant R01HL150147 from the National Institutes of Health of the United States of America.

References

  • [1] Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Summers, R.M., et al.: The medical segmentation decathlon. Nature communications 13(1),  4128 (2022)
  • [2] Armato, S.G., Roberts, R.Y., Mcnitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Mclennan, G., Engelmann, R.M., Bland, P.H., Aberle, D.R., Kazerooni, E.A.: The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Academic Radiology 14(12), 1455–1463 (2007)
  • [3] Carmo, D., Pinheiro, G., Rodrigues, L., Abreu, T., Lotufo, R., Rittner, L.: Automated computed tomography and magnetic resonance imaging segmentation using deep learning: a beginner’s guide (2023)
  • [4] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: Learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. pp. 424–432. Springer International Publishing, Cham (2016)
  • [5] Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)
  • [6] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [7] Islam, M.M., Karray, F., Alhajj, R., Zeng, J.: A review on deep learning techniques for the diagnosis of novel coronavirus (covid-19). Ieee Access 9, 30551–30572 (2021)
  • [8] Kermany, D., Zhang, K., Goldbaum, M., et al.: Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley data 2(2),  651 (2018)
  • [9] Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A.: Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In: Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge. vol. 5, p. 12 (2015)
  • [10] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [11] Ramamonjisoa, M., Firman, M., Watson, J., Lepetit, V., Turmukhambetov, D.: Single image depth prediction with wavelet decomposition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11089–11098 (2021)
  • [12] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [13] Shen, L., Zhao, W., Xing, L.: Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nature biomedical engineering 3(11), 880–888 (2019)
  • [14] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1874–1883 (2016)
  • [15] Shu, C., Chen, Z., Chen, L., Ma, K., Wang, M., Ren, H.: Sidert: A real-time pure transformer architecture for single image depth estimation. arXiv preprint arXiv:2204.13892 (2022)
  • [16] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)
  • [17] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • [18] Tan, Z., Li, J., Tao, H., Li, S., Hu, Y.: Xctnet: Reconstruction network of volumetric images from a single x-ray image. Computerized Medical Imaging and Graphics 98, 102067 (2022)
  • [19] Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A.: Self-supervised pre-training of swin transformers for 3d medical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20730–20740 (2022)
  • [20] Ying, X., Guo, H., Ma, K., Wu, J., Weng, Z., Zheng, Y.: X2ct-gan: reconstructing ct from biplanar x-rays with generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10619–10628 (2019)
  • [21] Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627 (2021)
  • [22] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)