Simulation-driven training of vision transformers enables metal artifact reduction of highly truncated CBCT scans

Fuxin Fan; Ludwig Ritschl; Marcel Beister; Ramyar Biniazan; Fabian Wagner; Björn Kreher; Tristan M Gottschalk; Steffen Kappler; Andreas Maier

doi:10.1002/mp.16919

Simulation-driven training of vision transformers enables metal artifact reduction of highly truncated CBCT scans

Med Phys. 2024 May;51(5):3360-3375. doi: 10.1002/mp.16919. Epub 2023 Dec 27.

Authors

Fuxin Fan¹, Ludwig Ritschl², Marcel Beister², Ramyar Biniazan², Fabian Wagner¹, Björn Kreher², Tristan M Gottschalk², Steffen Kappler², Andreas Maier¹

Affiliations

¹ Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.
² Siemens Healthcare GmbH, Forchheim, Germany.

PMID: 38150576
DOI: 10.1002/mp.16919

Abstract

Background: Due to the high attenuation of metals, severe artifacts occur in cone beam computed tomography (CBCT). The metal segmentation in CBCT projections usually serves as a prerequisite for metal artifact reduction (MAR) algorithms.

Purpose: The occurrence of truncation caused by the limited detector size leads to the incomplete acquisition of metal masks from the threshold-based method in CBCT volume. Therefore, segmenting metal directly in CBCT projections is pursued in this work.

Methods: Since the generation of high quality clinical training data is a constant challenge, this study proposes to generate simulated digital radiographs (data I) based on real CT data combined with self-designed computer aided design (CAD) implants. In addition to the simulated projections generated from 3D volumes, 2D x-ray images combined with projections of implants serve as the complementary data set (data II) to improve the network performance. In this work, SwinConvUNet consisting of shift window (Swin) vision transformers (ViTs) with patch merging as encoder is proposed for metal segmentation.

Results: The model's performance is evaluated on accurately labeled test datasets obtained from cadaver scans as well as the unlabeled clinical projections. When trained on the data I only, the convolutional neural network (CNN) encoder-based networks UNet and TransUNet achieve only limited performance on the cadaver test data, with an average dice score of 0.821 and 0.850. After using both data II and data I during training, the average dice scores for the two models increase to 0.906 and 0.919, respectively. By replacing the CNN encoder with Swin transformer, the proposed SwinConvUNet reaches an average dice score of 0.933 for cadaver projections when only trained on the data I. Furthermore, SwinConvUNet has the largest average dice score of 0.953 for cadaver projections when trained on the combined data set.

Conclusions: Our experiments quantitatively demonstrate the effectiveness of the combination of the projections simulated under two pathways for network training. Besides, the proposed SwinConvUNet trained on the simulated projections performs state-of-the-art, robust metal segmentation as demonstrated on experiments on cadaver and clinical data sets. With the accurate segmentations from the proposed model, MAR can be conducted even for highly truncated CBCT scans.

Keywords: data argumentation; metal artifact reduction; metal segmentation; swin vision transformer.

MeSH terms

Algorithms
Artifacts*
Computer Simulation
Cone-Beam Computed Tomography* / methods
Humans
Image Processing, Computer-Assisted* / methods
Metals* / chemistry

Substances

Metals

Abstract

MeSH terms

Substances

Grants and funding