11institutetext: Biomedical Engineering, Yale University, New Haven, USA
11email: [email protected]
22institutetext: Computer Science, Yale University, New Haven, USA 33institutetext: Radiology & Biomedical Imaging, Yale University, New Haven, USA 44institutetext: Department of Internal Medicine (Cardiology), Yale School of Medicine, New Haven, USA

Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

Xiaoran Zhang\orcidlink0000-0001-8918-7374 11    John C. Stendahl\orcidlink0000-0002-1568-9280 1144    Lawrence H. Staib\orcidlink0000-0002-9516-5136 1133    Albert J. Sinusas\orcidlink0000-0003-0972-9589 113344    Alex Wong\orcidlink0000-0002-3157-6016 22    James S. Duncan\orcidlink0000-0002-5167-9856 1133
Abstract

We propose an adaptive training scheme for unsupervised medical image registration. Existing methods rely on image reconstruction as the primary supervision signal. However, nuisance variables (e.g. noise and covisibility), violation of the Lambertian assumption in physical waves (e.g. ultrasound), and inconsistent image acquisition can all cause a loss of correspondence between medical images. As the unsupervised learning scheme relies on intensity constancy between images to establish correspondence for reconstruction, this introduces spurious error residuals that are not modeled by the typical training objective. To mitigate this, we propose an adaptive framework that re-weights the error residuals with a correspondence scoring map during training, preventing the parametric displacement estimator from drifting away due to noisy gradients, which leads to performance degradation. To illustrate the versatility and effectiveness of our method, we tested our framework on three representative registration architectures across three medical image datasets along with other baselines. Our adaptive framework consistently outperforms other methods both quantitatively and qualitatively. Paired t-tests show that our improvements are statistically significant. Code available at: https://voldemort108x.github.io/AdaCS/.

1 Introduction

Deformable medical image registration aims to accurately determine non-rigid correspondences through dense displacement vectors between source and target images. This process is a crucial step for medical image analysis, such as tracking disease progression for diagnosis and treatment [29, 14, 39]. Due to the impracticality of obtaining ground truth displacement, it has been a long-standing problem and has been extensively studied in the past decades [23, 31, 2, 5, 8, 22, 41, 38, 40].

Classical methods approach this challenge by solving an iterative pair-wise optimization problem between source and target images using elastic-type models [11, 23], free-form deformations with b-splines [31], and topology-preserving diffeomorphic models [2, 3]. However, these approaches are computationally expensive and time-consuming, limiting their practical utility in large-scale real-world data. Recently, learning-based approaches have been widely adopted for their speed (GPU accelerated forward pass that is hundreds of times faster in inference speed than classical methods) and state-of-the-art performance [5, 41, 22, 8, 9, 33]. Due to lack of ground truth displacement, these approaches leverage dense image or volumetric reconstruction in an unsupervised setting, or make use of segmentation masks in a weakly supervised setting. To train these approaches via gradient-based optimization, a source image is warped by the estimated displacement to reconstruct a target image. The supervision signal comes from minimizing the reconstruction error between warped source and target as the data-fidelity term, along with a regularizer based on the assumption that the object imaged is locally smooth and connected. The feasibility in minimizing this objective relies on intensity constancy between the two images during imaging.

Refer to caption
Figure 1: Existing approaches assume uniform intensity constancy and covisibility across the entire image; during training, this causes irreconcilable penalties, i.e., regions with large error residuals due to the absence of correspondence as highlighted in the red box. Our proposed approach addresses this by re-weighting error residuals with a predictive correspondence scoring map. By doing so, we get a smoother optimization when we reduce the influence of these outliers, leading to an improved performance.

However, assuming such intensity constancy uniformly across the entire image domain neglects the facts of motion ambiguity and non-uniform noise corruption in medical images. Assuming sufficiently exciting local regions, only a subset of pixels of one image can be uniquely matched or corresponded to another based on the image intensity profiles. As shown in Fig. 1, despite a displacement estimator model predicting largely correct corresponding pixels between the two images, the error residuals between the warped source and target images are still non-zero. In fact, they are dominated by regions with no correspondence; when this phenomenon occurs during the training of a displacement estimator, it results in performance degradation (driven towards an undesirable minimum) in the subsequent optimization steps due to the noisy gradients. To address this issue, we propose an unsupervised correspondence scoring framework that identifies regions with high chance of establishing correspondence and adaptively reduces the influence of error residuals from nuisance variability during training. This prevents the displacement estimator from drifting away due to large residuals caused by the lack of correspondence between input image pairs.

Our proposed scoring estimator is deployed during the unsupervised training of a displacement estimator and yields a soft scoring map. It is optimized alternatingly together with the displacement estimator to adaptively re-weight the data-fidelity term in order to mitigate the negative impact of nuisances when establishing correspondence. We introduce an unsupervised training objective for the scoring estimator to learn an accurate correspondence scoring map without additional annotation. Yet, there exists a trivial solution of predicting a score map of all zeros. Therefore, we additionally propose a regularizer for the scoring estimator to bias it away from the trivial solution, along with a momentum-guided adaptive total variation to encourage smoothness in the scoring map. To validate the effectiveness of our proposed method, we tested on three different datasets including: (1) ACDC [6], a public 2D MRI dataset, (2) CAMUS [24], a public 2D ultrasound dataset, and (3) a private 3D echocardiography dataset [1, 35]. To further show the versatility of our proposed method, we tested on three representative unsupervised image registration architectures including: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22] along with other baselines. Our proposed framework can be applied in a plug-and-play manner to consistently improve existing methods. Paired t-tests show that the improvements obtained by utilizing our approach are statistically significant.

Our contributions are as follows: (1) We propose an adaptive framework that incorporates correspondence scoring for unsupervised deformable medical image registration. (2) We introduce an unsupervised correspondence scoring network to be used during the training of a displacement estimator. The scoring network learns to determine whether a given image allows for establishing correspondence by minimizing the typical image reconstruction loss with scoring and momentum-guided adaptive regularization. (3) Our proposed method consistently outperforms other baselines across three representative registration architectures over three medical image datasets with diverse modalities. The performance gain comes with no cost in memory or run time during inference, but only the deployment of our scoring estimation during training.

2 Related works

Unsupervised medical image registration. Balakrishnan et al. [5] proposed an unsupervised learning framework using U-Net as the displacement estimator. This framework imposes intensity constancy by minimizing the mean squared error between the warped source image and the target image to update the parametric displacement estimator via gradient-based optimization. A number of works have been developed upon this architecture including adding diffeomorphic regularization (Voxelmorph-diff) [10], jointly learning amortized hyperparameters (Hypermorph) [18] and learning contrast-invariant registration without acquired images (SynthMorph) [15]. Recently, Zhang et al. [40] proposed an uncertainty estimation framework that extends the widely used homoscedastic assumption in objectives to heteroscedastic assumption. Inspired by the recent advances in vision transformers [25, 12], Chen et al. [8] introduced TransUNet [7], a hybrid Transformer-CNN architecture. This design replaces encoders with Swin Transformer [25] to enhance the receptive field while preserving convolutional decoders to bolster the model’s ability to capture long-range motion. A number of extensions to this architecture have been proposed, such as substituting convolutional decoders with transformer layers [33] and incorporating multi-scale pyramids [26]. Additionally, score-based diffusion models such as DDPM [34] have shown high-quality performance in generative modeling. To leverage the advantage of DDPM, Kim et al. [22] presents a diffusion-based architecture, composed of a diffusion network and a deformation network. Recent work built upon this architecture explores adding feature and score-wise diffusion [30].

We demonstrate our proposed adaptive framework and compare it with with other related formulations as baselines on three representative architectures including: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22]. We also tested our proposed framework on c-LapIRN [27] and deferred the results to the Supp. Mat. due to the page limit.

Adaptive weighting schemes. A wide range of image processing problems involve optimizing an energy function that combines a data-fidelity term and a regularization term. The relative importance between the two terms is usually weighted by a scalar, which disregards the heteroscedastic nature of error residuals [36]. To address this challenge, several adaptive weighting schemes are proposed in the spatial domain and over the course of optimization based on the local residual [17, 16, 37]. Wong et al. [36] later provides a data-driven algorithm that deals with multiple frames [36]. Zhang et al. [40] proposed a … In this paper, we selected AdaReg [37] and AdaFrame [36] as our baselines for adaptive weighting. The aforementioned methods are not learning-based; whereas, we proposed a learning-based correspondence scoring in a collaborative framework.

Aleatoric uncertainty estimation. Our proposed adaptive correspondence scoring is conceptually related to aleatoric uncertainty modeling in the Bayesian learning framework, which aims to estimate input-dependent noise inherent in the observations [21, 4, 28, 19, 32]. This can be attributed to for example motion noise or sensor noise, resulting in uncertainty which cannot be reduced even if more data were to be collected. Kendall et al. [21] proposed a maximum likelihood estimation (MLE) formulation that minimizes the negative log-likelihood (NLL) criterion using stochastic gradient descent. This approach re-weights the data-fidelity term using predictive variance estimates mediated by a regularization term after assuming noise distribution is heteroscedastic Gaussian. Seitzer et al. [32] later identifies that such formulation using inverse variance weighting will result in over-confident of variance estimates, leading to undesired undersampling. Thus, an exponentiated β𝛽\betaitalic_β term is proposed in the new loss formulation, termed β𝛽\betaitalic_β-NLL, to counteract the undersampling leading to undesired performance. In this paper, we selected NLL [21] and β𝛽\betaitalic_β-NLL [32] as baselines for aleatoric uncertainty estimation with a U-Net based variance estimator that is jointly trained with displacement estimator under each formulation.

3 Preliminaries

Let Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT be the source image and Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the target image, where I:Ω[0,1]:𝐼maps-toΩ01I:\Omega\mapsto[0,1]italic_I : roman_Ω ↦ [ 0 , 1 ] is the imaging function after normalization and ΩΩ\Omegaroman_Ω is the image space. Deformable image registration aims to estimate a dense displacement vector field that characterizes the correspondence between two images for each pixel u^:Ω2:^𝑢maps-toΩsuperscript2\hat{u}:\Omega\mapsto\mathbbm{R}^{2}over^ start_ARG italic_u end_ARG : roman_Ω ↦ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (Ω3maps-toΩsuperscript3\Omega\mapsto\mathbbm{R}^{3}roman_Ω ↦ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for 3D data). For ease of notation and terminology, we will use the terms images and pixels to refer to both 2D and 3D data.

Due to the lack of ground truth, intensity constancy and the smoothness assumption are imposed to constrain the parametric model fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) (e.g., a neural network) as displacement estimator u^=fθ(It,Is)^𝑢subscript𝑓𝜃subscript𝐼𝑡subscript𝐼𝑠\hat{u}=f_{\theta}(I_{t},I_{s})over^ start_ARG italic_u end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) by minimizing the following objective to update the parameters θ𝜃\thetaitalic_θ in the model

=1|Ω|xΩ[It(x)Is(x+u^(x))]2data+λu^(x)2,1Ωsubscript𝑥Ωsubscriptsuperscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢𝑥2subscriptdata𝜆superscriptnorm^𝑢𝑥2\mathcal{L}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\underbrace{[I_{t}(x)-I_{s}(x+% \hat{u}(x))]^{2}}_{\mathcal{L}_{\text{data}}}+\lambda\|\nabla\hat{u}(x)\|^{2},caligraphic_L = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT under⏟ start_ARG [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ( italic_x ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ ∥ ∇ over^ start_ARG italic_u end_ARG ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where xΩ𝑥Ωx\in\Omegaitalic_x ∈ roman_Ω denotes a coordinate and λ𝜆\lambdaitalic_λ denotes the hyperparameter to modulate the trade-off between two terms. In this work, we explore three representative deep neural network architectures that serve as displacement estimators including (1) convolution-based (Voxelmorph [5]), (2) transformer-based (Transmorph [8]) and (3) diffusion-based (Diffusemorph [22]).

4 Methods

Refer to caption
Figure 2: Diagram of training pipeline of our proposed adaptive scoring framework. Our proposed framework first estimates displacement u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG from source and target image pair (Is,It)subscript𝐼𝑠subscript𝐼𝑡(I_{s},I_{t})( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We then apply spatial transform to obtain the warped source image Is(x+u^)subscript𝐼𝑠𝑥^𝑢I_{s}(x+\hat{u})italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ). Before computing error residuals, we estimate the correspondence scoring map from the target image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then adaptively weight the error residuals for gradient-based optimization. The detailed training strategy is discussed in Algorithm 1.

4.1 Adaptive displacement estimation

To train these networks in an unsupervised fashion, one typically minimizes Eq. 1 with respect to the model parameters. However, the data term datasubscriptdata\mathcal{L}_{\text{data}}caligraphic_L start_POSTSUBSCRIPT data end_POSTSUBSCRIPT, which measures the intensity difference between two estimated correspondences, relies on intensity constancy and assumes that surfaces reflecting the physical waves (i.e., ultrasound) are largely Lambertian [20] and the acquisition techniques are consistent – in which case, one can determine unique correspondences between two images, if they are covisible. In the case where the corresponding pixel is not covisible, the solution cannot be uniquely determined and one must rely on regularization, i.e., local smoothness modeled by a diffusion regularizer. Under realistic scenarios, the Lambertian assumption is often violated and the difficulty of establishing unique correspondences is further exacerbated by the presence of noise from the sensor.

These conditions can introduce erroneous supervision signals when optimizing the parameters of the model. Suppose that one were to correctly identify the corresponding pixels between two images, the above nuisance factors would still cause the data-fidelity term of Eq. 1 to yield non-zero residuals, which induces gradients when backpropagating. Within the optimization of the weights θ𝜃\thetaitalic_θ, this update may translate to moving out of a local (possibly optimal) minima and result in performance degradations as shown in Fig. 1.

Thus, we propose an adaptive framework shown in Fig. 2 that incorporates a predictive correspondence scoring map to prevent displacement estimation (de) from being dominated by error residuals due to nuisance variability. Our method is realized as an adaptive weighting term that can be generically integrated into the conventional loss function for unsupervised training:

de=1|Ω|xΩS^(x)[It(x)Is(x+u^(x))]2+λu^(x)2.subscriptde1Ωsubscript𝑥Ω^𝑆𝑥superscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢𝑥2𝜆superscriptnorm^𝑢𝑥2\mathcal{L}_{\text{de}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\lfloor\hat{S}(x)% \rfloor[I_{t}(x)-I_{s}(x+\hat{u}(x))]^{2}+\lambda\|\nabla\hat{u}(x)\|^{2}.caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT ⌊ over^ start_ARG italic_S end_ARG ( italic_x ) ⌋ [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ( italic_x ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ ∇ over^ start_ARG italic_u end_ARG ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

S^:Ω[0,1]:^𝑆maps-toΩ01\hat{S}:\Omega\mapsto[0,1]over^ start_ARG italic_S end_ARG : roman_Ω ↦ [ 0 , 1 ] is a dense predictive pixel-wise scoring map to model the confidence score of how well one can establish correspondence between the two images, where higher values indicate a lower degree of effect from the nuisance variables (i.e., covisibility, noise, etc.). Note that this is equivalent to treating S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG as the degree to which nuisance variables affect establishing correspondence between source and target images, and using its complement 1S^1^𝑆1-\hat{S}1 - over^ start_ARG italic_S end_ARG to downweight pixels that lack correspondence. Floor symbol \lfloor\cdot\rfloor⌊ ⋅ ⌋ denotes stop gradient operation.

Unsupervised correspondence scoring. The scoring estimator gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) predicts a soft map to show the degree of correspondence existence from the target image as S^=gϕ(It)^𝑆subscript𝑔italic-ϕsubscript𝐼𝑡\hat{S}=g_{\phi}(I_{t})over^ start_ARG italic_S end_ARG = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). During training, the scoring estimator is optimized alternatingly with the displacement estimator fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) under the following unsupervised correspondence scoring (ucs) objective

ucs=1|Ω|xΩS^(x)[It(x)Is(x+u^(x))]2.subscriptucs1Ωsubscript𝑥Ω^𝑆𝑥superscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢𝑥2\displaystyle\mathcal{L}_{\text{ucs}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\hat{% S}(x)[I_{t}(x)-I_{s}(x+\lfloor\hat{u}(x)\rfloor)]^{2}.caligraphic_L start_POSTSUBSCRIPT ucs end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG ( italic_x ) [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + ⌊ over^ start_ARG italic_u end_ARG ( italic_x ) ⌋ ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

Eq. 3 encourages the estimator to assign a lower score to regions with higher error residuals computed using the displacement estimated by fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ). Note that u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG is detached by stop gradient operator \lfloor\cdot\rfloor⌊ ⋅ ⌋ to avoid doubly traversing the computational graph. However, minimizing Eq. 3 alone will lead to a trivial solution where all scores are zeros. Thus, proper regularization is needed to prevent the estimator from learning such a solution.

Scoring estimator regularization. Given the range of the correspondence scoring map is S^(x)[0,1]^𝑆𝑥01\hat{S}(x)\in[0,1]over^ start_ARG italic_S end_ARG ( italic_x ) ∈ [ 0 , 1 ], we design an objective to regularize the scoring map to avoid the solution of all zeros:

reg=1|Ω|xΩ[1S^(x)]2.subscriptreg1Ωsubscript𝑥Ωsuperscriptdelimited-[]1^𝑆𝑥2\displaystyle\mathcal{L}_{\text{reg}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[1-% \hat{S}(x)]^{2}.caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT [ 1 - over^ start_ARG italic_S end_ARG ( italic_x ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

With the above regularizer alone, the scoring map will inevitably be non-smooth (e.g., flipping scores between pixels) around neighboring regions that the model identifies as low correspondence during training. This is not desirable since such irregularities in the scoring map will also result in irregular supervision signal for the displacement estimator (i.e., large discrepancies in the data term within a neighborhood). This creates artifacts and distortion in image warping, which will degrade resulting performance (Tab. 2). Thus, we impose a smoothness term in the predictive scoring map to preserve such characteristics.

Momentum guided adaptive smoothness regularization. To impose smoothness regularization on the estimated correspondence scoring map, we introduce a momentum-guided adaptive weighting strategy that follows the training dynamics of the displacement estimator. In the early stages of training, the displacement and scoring estimators will be inaccurate. Having strong smoothness will impede the learning by constraining scores; thus, lower degree of regularization during this phase should be imposed to allow for exploration to find suitable hypotheses. In the later stage of training, we increase degree of regularization to allow for convergence when re-weighting error residuals.

To accommodate this design, we utilize the momentum of error residuals as an indicator of the current training status, adjusting the degree of the smoothness constraint accordingly. The mean residuals at the training step T𝑇Titalic_T is given by

μT=1|Ω|xΩ[It(x)Is(x+u^T(x))]2.subscript𝜇𝑇1Ωsubscript𝑥Ωsuperscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥subscript^𝑢𝑇𝑥2\mu_{T}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[I_{t}(x)-I_{s}(x+\lfloor\hat{u}_{T% }(x)\rfloor)]^{2}.italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + ⌊ over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ⌋ ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

Recall that Is(x),It(x)[0,1]subscript𝐼𝑠𝑥subscript𝐼𝑡𝑥01I_{s}(x),I_{t}(x)\in[0,1]italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ∈ [ 0 , 1 ] and u^Tsubscript^𝑢𝑇\lfloor\hat{u}_{T}\rfloor⌊ over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ⌋ denotes stopping gradient on estimated displacement. Note: the mean residuals are within range μT[0,1]subscript𝜇𝑇01\mu_{T}\in[0,1]italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. We then apply a cosine function that is monotonically decreasing concavely as the activation

bT=cosπ2μT.subscript𝑏𝑇𝜋2subscript𝜇𝑇b_{T}=\cos\frac{\pi}{2}\mu_{T}.italic_b start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_cos divide start_ARG italic_π end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . (6)

To compute the momentum of residuals mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we apply the exponential moving average across different time steps with decay factor γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99 and m0=0subscript𝑚00m_{0}=0italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 as

mT=γmT1+(1γ)bT.subscript𝑚𝑇𝛾subscript𝑚𝑇11𝛾subscript𝑏𝑇m_{T}=\gamma m_{T-1}+(1-\gamma)b_{T}.italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_γ italic_m start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT + ( 1 - italic_γ ) italic_b start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . (7)

We use the computed momentum mT[0,1]subscript𝑚𝑇01m_{T}\in[0,1]italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ [ 0 , 1 ] as the adaptive weight to reflect the training progress and use it to guide the diffusion regularizer:

smooth=mT1|Ω|xΩS^(x)2.subscriptsmoothsubscript𝑚𝑇1Ωsubscript𝑥Ωsuperscriptnorm^𝑆𝑥2\mathcal{L}_{\text{smooth}}=m_{T}\frac{1}{|\Omega|}\sum_{x\in\Omega}\|\nabla% \hat{S}(x)\|^{2}.caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT ∥ ∇ over^ start_ARG italic_S end_ARG ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

By utilizing the scoring estimator during training, we prevent the displacement estimator from escaping local minima by re-weighting the error residuals with a smooth pixel-wise correspondence scoring, leading to registration performance improvement (Tab. 1). The effectiveness of momentum-based weighting is shown in Fig. 5. To exploit the characteristics of the displacement estimator and scoring estimator, a proper optimization strategy needs to be designed to ensure the effectiveness of the proposed adaptive framework as detailed below.

4.2 Optimization of proposed adaptive framework

Our loss for the scoring estimator gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) is a combination of Eqs. 3, 4 and 8

se=ucs+αreg+βsmoothsubscriptsesubscriptucs𝛼subscriptreg𝛽subscriptsmooth\mathcal{L}_{\text{se}}=\mathcal{L}_{\text{ucs}}+\alpha\mathcal{L}_{\text{reg}% }+\beta\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT se end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT ucs end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT (9)

with hyperparameter α𝛼\alphaitalic_α and β𝛽\betaitalic_β to modulate the trade-off between different terms.

To optimize displacement estimator fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and scoring estimator gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) in training, we propose a collaborative strategy summarized in Algorithm 1. Since the parameter update of the displacement estimator and scoring estimator is correlated, as defined in Eq. 2 and Eq. 9, the noise in gradients tends to get exacerbated in the early stage of training when both estimators fail to provide each other an accurate prediction in order to properly optimize. To prevent this, we propose a warm-up stage that trains fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) individually for Nwsubscript𝑁𝑤N_{w}italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT epochs to reduce the error propagation in between. After the warm-up stage, we perform alternating optimization for both estimators.

Data: Source image Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and target image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Result: Estimated displacement u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG
1 Initialization, Nwsubscript𝑁𝑤N_{w}italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT: warm-up epochs, N𝑁Nitalic_N: number of epochs;
2 while epoch i𝑖iitalic_i to N𝑁Nitalic_N do
3       if i<Nw𝑖subscript𝑁𝑤i<N_{w}italic_i < italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT then
4             flag_disp = True, flag_score = False;
5            
6      else if Nwi<2Nwsubscript𝑁𝑤𝑖2subscript𝑁𝑤N_{w}\leq i<2N_{w}italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≤ italic_i < 2 italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT then
7             flag_disp = False, flag_score = True;
8            
9      else
10             flag_disp = True, flag_score = True;
11            
12      if flag_disp then
13             u^=fθ(Is,It)^𝑢subscript𝑓𝜃subscript𝐼𝑠subscript𝐼𝑡\hat{u}=f_{\theta}(I_{s},I_{t})over^ start_ARG italic_u end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );
14             if flag_score then
15                   S^=gϕ(Is)^𝑆subscript𝑔italic-ϕsubscript𝐼𝑠\hat{S}=g_{\phi}(I_{s})over^ start_ARG italic_S end_ARG = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT );
16                  
17            else
18                   S^=𝟙^𝑆1\hat{S}=\mathbbm{1}over^ start_ARG italic_S end_ARG = blackboard_1;
19                  
20            θ=θηde(Is(x+u^),It,S^)θ𝜃𝜃𝜂subscriptdesubscript𝐼𝑠𝑥^𝑢subscript𝐼𝑡^𝑆𝜃\theta=\theta-\eta\frac{\partial\mathcal{L}_{\text{de}}(I_{s}(x+\hat{u}),I_{t}% ,\lfloor\hat{S}\rfloor)}{\partial\theta}italic_θ = italic_θ - italic_η divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⌊ over^ start_ARG italic_S end_ARG ⌋ ) end_ARG start_ARG ∂ italic_θ end_ARG;
21            
22      if flag_score then
23             u^=fθ(Is,It)^𝑢subscript𝑓𝜃subscript𝐼𝑠subscript𝐼𝑡\hat{u}=f_{\theta}(I_{s},I_{t})over^ start_ARG italic_u end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );
24             S^=gϕ(Is)^𝑆subscript𝑔italic-ϕsubscript𝐼𝑠\hat{S}=g_{\phi}(I_{s})over^ start_ARG italic_S end_ARG = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT );
25             ϕ=ϕηse(Is(x+u^),It,S^)ϕitalic-ϕitalic-ϕ𝜂subscriptsesubscript𝐼𝑠𝑥^𝑢subscript𝐼𝑡^𝑆italic-ϕ\phi=\phi-\eta\frac{\partial\mathcal{L}_{\text{se}}(I_{s}(x+\lfloor\hat{u}% \rfloor),I_{t},\hat{S})}{\partial\phi}italic_ϕ = italic_ϕ - italic_η divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT se end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + ⌊ over^ start_ARG italic_u end_ARG ⌋ ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_S end_ARG ) end_ARG start_ARG ∂ italic_ϕ end_ARG;
26            
27      
Algorithm 1 Training loop

By training our proposed adaptive framework using the above optimization strategy (Algorithm 1), we prevent the displacement estimator from drifting due to noisy gradients induced by nuisance variables such as covisibility and noise by re-weighting the error residuals using our adaptive correspondence scoring.

5 Experiments

Datasets. We tested our proposed framework on three different cardiac datasets, including two 2D public datasets, containing two medical imaging modalities (MRI and ultrasound), and one private 3D dataset. To construct the image pair, we selected the end-diastole (ED) frame as the source image and the end-systole (ES) frame as the target image. ED to ES registration is considered long-range and most challenging in the cardiac sequence. The detailed steps of dataset preprocessing can be found in the Supp. Mat.

ACDC [6]. The ACDC dataset contains 2D human cardiac MRI from 150 patients with various cardiac conditions. We randomly selected 80 patients containing 751 image pairs for training, 20 patients containing 200 pairs for validation, and the remaining 50 patients containing 538 pairs for testing.

CAMUS [24]. The CAMUS dataset contains 2D human cardiac ultrasound images from 500 subjects. We randomly selected 600 image pairs for training, 200 pairs for validation, and another 200 pairs for testing.

Private 3D Echocardiography [35, 1]. To validate the effectiveness of our proposed method, we also tested on a private 3D echocardiography dataset and reported our results in the Supp. Mat.

Evaluation metrics. We evaluate our results quantitatively by warping myocardium segmentation in the source image with our predicted displacement u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG and compute anatomical conformance in terms of (1) Dice coefficient score (DSC) (2) Hausdorff distance (HD) and (3) Average surface distance (ASD) with ground truth segmentation in the target image. Definitions of metrics can be found in the Supp. Mat.

Implementation details. All our experiments were implemented using Pytorch on NVIDIA V100/A5000 GPUs. The architecture of the scoring estimator is implemented on a U-Net backbone. Code is provided to ensure reproducibility. To show the versatility of our proposed framework, we tested on three representative unsupervised registration architectures for each dataset: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22]. We also tested on c-LapIRN architecture [27] and present the quantitative results in the Supp. Mat. due to the page limit. The descriptions of baselines are summarized below and remaining details (i.e., hyperparameters) can be found in the Supp. Mat.

Adaptive weighting schemes. We present two baselines AdaReg and AdaFrame that, like us, utilize adaptive weighting schemes during training.

AdaReg [37]. We compute first the local error residual ρ=|It(x)Is(x+u^)|𝜌subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢\rho=|I_{t}(x)-I_{s}(x+\hat{u})|italic_ρ = | italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) | and the global residual σ=11|Ω|xΩ|It(x)Is(x+u^)|𝜎11Ωsubscript𝑥Ωsubscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢\sigma=\frac{1}{\frac{1}{|\Omega|}\sum_{x\in\Omega}|I_{t}(x)-I_{s}(x+\hat{u})|}italic_σ = divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT | italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) | end_ARG using the displacement estimator prediction u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG. We then compute the adaptive regularization weighting as α(x)=exp(cρσ)𝛼𝑥𝑐𝜌𝜎\alpha(x)=\exp(-\frac{c\rho}{\sigma})italic_α ( italic_x ) = roman_exp ( - divide start_ARG italic_c italic_ρ end_ARG start_ARG italic_σ end_ARG ), where c=50𝑐50c=50italic_c = 50. We then optimize the displacement estimator using the loss AdaReg=1|Ω|xΩ[It(x)Is(x+u^)]2+λα(x)u^2subscriptAdaReg1Ωsubscript𝑥Ωsuperscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢2𝜆superscriptnorm𝛼𝑥^𝑢2\mathcal{L}_{\text{AdaReg}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[I_{t}(x)-I_{s}% (x+\hat{u})]^{2}+\lambda\|\alpha(x)\nabla\hat{u}\|^{2}caligraphic_L start_POSTSUBSCRIPT AdaReg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_α ( italic_x ) ∇ over^ start_ARG italic_u end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

AdaFrame [36]. We first compute the local error residual δ=|It(x)Is(x+u^)|𝛿subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢\delta=|I_{t}(x)-I_{s}(x+\hat{u})|italic_δ = | italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) | and then normalize it with its mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ as ρ=δμσ2+ϵ𝜌𝛿𝜇superscript𝜎2italic-ϵ\rho=\frac{\delta-\mu}{\sqrt{\sigma^{2}+\epsilon}}italic_ρ = divide start_ARG italic_δ - italic_μ end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ end_ARG end_ARG. We then compute the adaptive weight activated by a scaled and shifted sigmoid function as α(x)=111+exp((aρb))𝛼𝑥111𝑎𝜌𝑏\alpha(x)=1-\frac{1}{1+\exp(-(a\rho-b))}italic_α ( italic_x ) = 1 - divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - ( italic_a italic_ρ - italic_b ) ) end_ARG where a=a0μ+ϵ𝑎subscript𝑎0𝜇italic-ϵa=\frac{a_{0}}{\mu+\epsilon}italic_a = divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_μ + italic_ϵ end_ARG and b=b0(1cosπμ)𝑏subscript𝑏01𝜋𝜇b=b_{0}(1-\cos\pi\mu)italic_b = italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - roman_cos italic_π italic_μ ). We choose a0=0.1subscript𝑎00.1a_{0}=0.1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.1 and b0=10subscript𝑏010b_{0}=10italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10. We then optimize the displacement estimator using the loss AdaFrame=1|Ω|xΩα(x)[It(x)Is(x+u^)]2+λu^2subscriptAdaFrame1Ωsubscript𝑥Ω𝛼𝑥superscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢2𝜆superscriptnorm^𝑢2\mathcal{L}_{\text{AdaFrame}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\alpha(x)[I_{% t}(x)-I_{s}(x+\hat{u})]^{2}+\lambda\|\nabla\hat{u}\|^{2}caligraphic_L start_POSTSUBSCRIPT AdaFrame end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT italic_α ( italic_x ) [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ ∇ over^ start_ARG italic_u end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Aleatoric uncertainty estimation. We present two baselines that utilize aleatoric uncertainty estimates in terms of predictive variance to weight the objective during training adaptively. In order to obtain the predictive variance, we utilize a U-Net as variance estimator h()h(\cdot)italic_h ( ⋅ ) that takes target image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and warped source Is(x+u^)subscript𝐼𝑠𝑥^𝑢I_{s}(x+\hat{u})italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) as input to predict the noise variance σ^I=h(It,Is(x+u^))subscript^𝜎𝐼subscript𝐼𝑡subscript𝐼𝑠𝑥^𝑢\hat{\sigma}_{I}=h(I_{t},I_{s}(x+\hat{u}))over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_h ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) ). The variance estimator is trained jointly with the displacement estimator.

NLL [21]. To jointly train the displacement and variance estimators, we compute the loss as NLL=1|Ω|xΩ1σ^I2(x)[It(x)Is(x+u^)]2+logσ^I2(x)subscriptNLL1Ωsubscript𝑥Ω1superscriptsubscript^𝜎𝐼2𝑥superscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢2subscriptsuperscript^𝜎2𝐼𝑥\mathcal{L}_{\text{NLL}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\frac{1}{\hat{% \sigma}_{I}^{2}(x)}[I_{t}(x)-I_{s}(x+\hat{u})]^{2}+\log\hat{\sigma}^{2}_{I}(x)caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) end_ARG [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x )

β𝛽\betaitalic_β-NLL [32]. To jointly train the displacement and variance estimators under β𝛽\betaitalic_β-NLL objective, we compute the loss as β-NLL=1|Ω|xΩσ^I(x)2β(1σ^I2(x)[It(x)Is(x+u^)]2+logσ^I2(x))subscript𝛽-NLL1Ωsubscript𝑥Ωsubscript^𝜎𝐼superscript𝑥2𝛽1superscriptsubscript^𝜎𝐼2𝑥superscriptdelimited-[]subscript𝐼𝑡𝑥subscript𝐼𝑠𝑥^𝑢2subscriptsuperscript^𝜎2𝐼𝑥\mathcal{L}_{\beta\text{-NLL}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\lfloor\hat{% \sigma}_{I}(x)^{2\beta}\rfloor(\frac{1}{\hat{\sigma}_{I}^{2}(x)}[I_{t}(x)-I_{s% }(x+\hat{u})]^{2}+\log\hat{\sigma}^{2}_{I}(x))caligraphic_L start_POSTSUBSCRIPT italic_β -NLL end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Ω end_POSTSUBSCRIPT ⌊ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT ⌋ ( divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) end_ARG [ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x ) ) where β=0.5𝛽0.5\beta=0.5italic_β = 0.5.

6 Results

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 3: Qualitative evaluation of our method against the second-best approach in each dataset (top two rows: ACDC [6] and bottom two rows: CAMUS [22]). Each block, delineated by black solid lines, contains source and target images with myocardium segmentation contours. The top row displays the original images, and the bottom row showcases head-to-head comparison (warped source Is(x+u^)subscript𝐼𝑠𝑥^𝑢I_{s}(x+\hat{u})italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG )) between our method and the second-best method. The yellow highlights indicate the ground truth ES myocardium. Dice scores are reported in the subtitles.

Registration accuracy. We present our quantitative evaluation in Tab. 1, where our proposed method shows consistent improvement over other baselines. Surprisingly, we observe that incorporating baselines (adaptive weighting schemes and uncertainty estimation) tend to harm performance. Unlike them, our method improves over the base models. Compared to the base models (Voxelmorph [5], Transmorph [8], and Diffusemorph [22]), which are the second-best methods in each architecture, our proposed method performs better especially in terms of Dice score on each dataset. We additionally conducted a paired t-test of our proposed method to show that our consistent improvement is statistically significant as in Tab. 3.

Table 1: Comparisons on contour-based metrics. Units: DSC (%) HD (vx) ASD (vx). Our method consistently improves on registration accuracy across different architectures and datasets.

ACDC [6] CAMUS [24] DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow Undeformed 47.98 7.91 2.32 66.77 10.87 2.61 Elastix [23] 77.26 4.95 1.28 80.18 10.02 1.81 CNN Voxelmorph [5] 79.48 4.79 1.27 81.50 8.72 1.74 NLL [21] 76.49 5.46 1.45 75.24 11.05 2.20 β𝛽\betaitalic_β-NLL [32] 78.74 5.07 1.33 79.75 9.39 1.93 AdaFrame [36] 66.38 5.80 1.67 77.88 10.54 1.93 AdaReg [37] 78.75 5.13 1.33 79.31 9.78 1.88 AdaCS (Ours) 80.50 4.69 1.23 81.74 8.55 1.72 Transformer Transmorph [8] 76.94 5.51 1.30 79.24 10.30 1.79 NLL [21] 73.12 7.22 1.27 75.08 11.60 1.79 β𝛽\betaitalic_β-NLL [32] 75.74 6.12 1.29 77.39 10.99 1.86 AdaFrame [36] 67.95 5.72 1.59 78.06 9.86 1.91 AdaReg [37] 76.22 5.68 1.29 78.12 10.62 1.84 AdaCS (Ours) 78.39 5.40 1.32 79.64 9.85 1.79 Diffusion Diffusemorph [22] 67.38 5.80 1.67 75.23 9.80 2.07 NLL [21] 66.24 5.84 1.73 74.78 10.62 2.15 β𝛽\betaitalic_β-NLL [32] 66.31 5.93 1.74 73.27 9.85 2.25 AdaFrame [36] 59.78 6.46 1.93 75.04 10.41 2.10 AdaReg [37] 69.41 6.25 1.78 74.36 10.66 2.21 AdaCS (Ours) 72.09 5.35 1.53 77.65 9.82 1.99

Table 2: Ablation study on loss terms. Note: Removing regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT leads to a degenerate solution of all-zeros for the scoring estimator gϕ()subscript𝑔italic-ϕg_{\phi}(\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ).

Loss ACDC [6] CAMUS [24] regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT smoothsubscriptsmooth{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{L}_{\text{% smooth}}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow vxm 80.24 4.64 1.23 81.58 8.89 1.74 80.50 4.69 1.23 81.74 8.55 1.72 tsm 77.84 5.41 1.33 79.58 10.17 1.81 78.39 5.40 1.32 79.64 9.85 1.79 dfm 71.62 5.56 1.58 77.32 9.71 2.00 72.09 5.35 1.53 77.65 9.82 1.99

Table 3: Paired t-test of our proposed method vs second-best method in Tab. 1 for each dataset in terms of DSC. The p-values shows that our improvement is statistically significant.

Voxelmorph [5] Transmorph [8] Diffusemorph [22] ACDC [6] p=5.85×1016𝑝5.85superscript1016p=5.85\times 10^{-16}italic_p = 5.85 × 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT p=0.046𝑝0.046p=0.046italic_p = 0.046 p=1.01×1041𝑝1.01superscript1041p=1.01\times 10^{-41}italic_p = 1.01 × 10 start_POSTSUPERSCRIPT - 41 end_POSTSUPERSCRIPT CAMUS [24] p=0.01𝑝0.01p=0.01italic_p = 0.01 p=0.01𝑝0.01p=0.01italic_p = 0.01 p=4.56×1041𝑝4.56superscript1041p=4.56\times 10^{-41}italic_p = 4.56 × 10 start_POSTSUPERSCRIPT - 41 end_POSTSUPERSCRIPT

To quantitatively evaluate the registration performance, we plot the warped source image along with segmentation overlayed with ground truth shown in Fig. 3. Our proposed framework is visually better across datasets with different modalities and frameworks. Both the quantitative and qualitative evaluations show that our proposed framework captures more accurate correspondence, validating the effectiveness of our proposed adaptive scoring.

Tab. 1 shows that both uncertainty weighting strategies, NLL [21] and β𝛽\betaitalic_β-NLL [32], did not yield improvements over the baseline. This observation indicates that optimization of uncertainty estimation cannot be approached with a simple joint optimization alongside image registration. Furthermore, we observe that adaptive regularization techniques, exemplified by AdaFrame [36] and AdaReg [37], were ineffective in enhancing performance, which could be attributed to the computation of adaptive weights based on statistical assumptions of error residuals contrary to our proposed unsupervised learning approach. We note a discernible decline in performance when transitioning from Voxelmorph [5], which utilizes ConvNets, to Transmorph [8] and Diffusemorph [22]. This decline underscores the challenges faced by transformer and diffusion-based models when applied to datasets comprising smaller-scale medical images. We also present some failure cases in the Supp. Mat., where the myocardium (typically challenging due to irregular volumes) is considerably thin.

Refer to caption
Figure 4: Qualitative visualization of our proposed framework in Voxelmorph architecture [5] on ACDC [6] (top row) and CAMUS [24] (bottom row) validation sets. The third column exhibits successful matching corroborated by the estimated displacement in the fourth column, but the error map in the fifth column reveals residuals. Our predicted scoring map in the sixth column identifies and prevents drift of fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ), as demonstrated by the re-weighted error in the last column.

Results of correspondence scoring map and adaptive weighting. To qualitatively evaluate the effectiveness of our proposed correspondence scoring during training, we present Fig. 4 to show that our predicted scoring map accurately identifies the regions with low correspondence and prevents the displacement estimator fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) from drifting away by re-weighting the error residuals.

Refer to caption Refer to caption
Figure 5: Training dynamics of μTsubscript𝜇𝑇\mu_{T}italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Left: Mean of error residuals μTsubscript𝜇𝑇\mu_{T}italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in Eq. 5. Right: Adaptive momentum guided weight mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in Eq. 7.
Table 4: Effectiveness of the momentum-based weighting mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in Eq. 8.

ACDC CAMUS DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow vxm w/o mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 80.00 4.71 1.24 81.27 8.86 1.76 w. mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 80.50 4.69 1.23 81.74 8.55 1.72 tsm w/o mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 77.12 5.65 1.32 79.43 9.85 1.79 w. mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 78.39 5.40 1.32 79.64 9.85 1.79 dfm w/o mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 71.62 5.56 1.58 77.28 9.61 2.00 w. mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 72.09 5.35 1.53 77.65 9.82 1.99

In Fig. 5, we illustrate the behavior of our proposed adaptive momentum-guided weight, denoted as mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (Eq. 7). This weight dynamically adapts during training, responding to the evolving characteristics of error residuals (Eq. 5). The observed evolution of our adaptive weight aligns with our design objective, progressively enhancing smoothness penalization as training toward convergence. The effectiveness of mTsubscript𝑚𝑇m_{T}italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is also evidenced by our quantitative ablation analysis in Fig. 5, in which our proposed formulation generally achieves a better performance with the adaptive weight.

Ablation study. We present Tab. 2 for ablation studies to show the effectiveness of each loss term. Without regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT and smoothsubscriptsmooth\mathcal{L}_{\text{smooth}}caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT, the predicted correspondence scoring map S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG will result in a degenerate solution of all zeros. By introducing regularization and smoothness constraints, the performance of our proposed framework steadily increases across various datasets and registration architectures. We also perform an ablation study comparing with robust losses including mutual information (MI) and normalized cross correlation (NCC) shown in Tab. 4 in the Supp. Mat. Our conclusions still hold as our method yields consistent performance improvements over the baselines.

Training stability and sensitivity to hyperparameters. We first show the training stability by plotting the training curves of our proposed unsupervised scoring term ucssubscriptucs\mathcal{L}_{\text{ucs}}caligraphic_L start_POSTSUBSCRIPT ucs end_POSTSUBSCRIPT in Eq. 3 and regularization term regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT in Eq. 4 in the left figure of Fig. 6, demonstrating both terms can be jointly minimized. We further conduct a sensitivity study on hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β in Eq. 9. While they are not very sensitive (change within \approx 1% Dice for various α,β𝛼𝛽\alpha,\betaitalic_α , italic_β), tuning more can indeed yield better results.

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6: Left: Training loss curves of ucssubscriptucs\mathcal{L}_{\text{ucs}}caligraphic_L start_POSTSUBSCRIPT ucs end_POSTSUBSCRIPT in Eq. 3 and regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT in Eq. 4. Right: Sensitivity to hyperparameters of α𝛼\alphaitalic_α and β𝛽\betaitalic_β in Eq. 9 of our proposed Voxelmorph-based approach.
Table 5: Dice standard deviation, % of |Ju^|0subscript𝐽^𝑢0|J_{\hat{u}}|\leq 0| italic_J start_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG end_POSTSUBSCRIPT | ≤ 0 for displacement smoothness evaluation, and training time (measured in hours) of 300 epochs.

ACDC CAMUS DSC \uparrow |Ju^|0subscript𝐽^𝑢0|J_{\hat{u}}|\leq 0| italic_J start_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG end_POSTSUBSCRIPT | ≤ 0 \downarrow Ttrainsubscript𝑇trainT_{\text{train}}italic_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT DSC \uparrow |Ju^|0subscript𝐽^𝑢0|J_{\hat{u}}|\leq 0| italic_J start_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG end_POSTSUBSCRIPT | ≤ 0 \downarrow Ttrainsubscript𝑇trainT_{\text{train}}italic_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT Voxelmorph 79.48 ±plus-or-minus\pm± 9.23 0.29 0.26 81.50 ±plus-or-minus\pm± 5.58 0.60 0.26 AdaCS (Ours) 80.50 ±plus-or-minus\pm± 8.58 0.22 0.43 81.74 ±plus-or-minus\pm± 5.36 0.30 0.43 Transmorph 76.94 ±plus-or-minus\pm± 8.93 0.76 0.60 79.24 ±plus-or-minus\pm± 6.06 1.41 0.59 AdaCS (Ours) 78.39 ±plus-or-minus\pm± 9.06 0.57 0.87 79.64 ±plus-or-minus\pm± 6.37 0.70 0.70 Diffusemorph 67.38 ±plus-or-minus\pm± 15.65 0.05 1.08 75.23 ±plus-or-minus\pm± 8.71 0.05 1.06 AdaCS (Ours) 72.09 ±plus-or-minus\pm± 13.60 0.06 1.86 77.65 ±plus-or-minus\pm± 7.64 0.08 1.91

Evaluation on smoothness and training time. We provide standard deviation and measure of smoothness (percentage of estimated displacement u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG with a negative Jacobian determinant, i.e., |Ju^|0subscript𝐽^𝑢0|J_{\hat{u}}|\leq 0| italic_J start_POSTSUBSCRIPT over^ start_ARG italic_u end_ARG end_POSTSUBSCRIPT | ≤ 0) in Table 5. All methods have <<<1%, implying physical plausibility and smoothness. Our method produces smoother deformations in Voxelmorph and Transmorph and achieves similarly smooth deformations in Diffusemorph, with <<<0.1%.

7 Conclusion

In this paper, we propose an adaptive correspondence scoring framework for unsupervised image registration to prevent the displacement estimator from drifting away by noisy gradients caused by low correspondence due to the nuisance variables such as noise or covisibility during training. We introduce a unsupervised correspondence scoring estimation scheme with both scoring and momentum-guided adaptive regularizations to prevent the scoring estimator from a degenerate solution and ensure scoring map smoothness. We demonstrate the effectiveness of our proposed framework on three representative registration architectures and we show consistent improvement compared with other baselines across three medical image datasets with diverse modalities. Though our proposed framework is promising, the hyperparameters in the scoring estimator loss does need to be tuned for each architecture on each dataset. Nonetheless, performance is not too sensitive to hyperparameter tuning, so one does not need to tune meticulously. In the future, we aim to explore an amortized hyperparameter optimization scheme during training to reduce the computation and validate our proposed framework on clinical datasets for further impact.

Acknowledgements

This work is supported by NIH/NHLBI grant R01HL121226.

References

  • [1] Ahn, S.S., Ta, K., Thorn, S.L., Onofrey, J.A., Melvinsdottir, I.H., Lee, S., Langdon, J., Sinusas, A.J., Duncan, J.S.: Co-attention spatial transformer network for unsupervised motion tracking and cardiac strain analysis in 3d echocardiography. Medical image analysis 84, 102711 (2023)
  • [2] Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1), 95–113 (Oct 2007)
  • [3] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis 12(1), 26–41 (Feb 2008)
  • [4] Bae, G., Budvytis, I., Cipolla, R.: Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13117–13126. IEEE, Montreal, QC, Canada (Oct 2021)
  • [5] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging 38(8), 1788–1800 (Aug 2019)
  • [6] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Gonzalez Ballester, M.A., Sanroma, G., Napel, S., Petersen, S., Tziritas, G., Grinias, E., Khened, M., Kollerathu, V.A., Krishnamurthi, G., Rohe, M.M., Pennec, X., Sermesant, M., Isensee, F., Jager, P., Maier-Hein, K.H., Full, P.M., Wolf, I., Engelhardt, S., Baumgartner, C.F., Koch, L.M., Wolterink, J.M., Isgum, I., Jang, Y., Hong, Y., Patravali, J., Jain, S., Humbert, O., Jodoin, P.M.: Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Transactions on Medical Imaging 37(11), 2514–2525 (Nov 2018)
  • [7] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (Feb 2021), arXiv:2102.04306 [cs]
  • [8] Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y.: TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis 82, 102615 (Nov 2022)
  • [9] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised Learning for Fast Probabilistic Diffeomorphic Registration. vol. 11070, pp. 729–738 (2018), arXiv:1805.04605 [cs]
  • [10] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57(1), 226–236 (Oct 2019)
  • [11] Davatzikos, C.: Spatial Transformation and Registration of Brain Images Using Elastically Deformable Models. Computer Vision and Image Understanding 66(2), 207–222 (May 1997)
  • [12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Jun 2021), arXiv:2010.11929 [cs]
  • [13] Hering, A., et al.: Learn2reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning (2022)
  • [14] Hill, D.L.G., Batchelor, P.G., Holden, M., Hawkes, D.J.: Medical image registration
  • [15] Hoffmann, M., Billot, B., Greve, D.N., Iglesias, J.E., Fischl, B., Dalca, A.V.: SynthMorph: learning contrast-invariant registration without acquired images. IEEE Transactions on Medical Imaging 41(3), 543–558 (Mar 2022), arXiv:2004.10282 [cs, eess, q-bio]
  • [16] Hong, B.W., Koo, J.K., Burger, M., Soatto, S.: Adaptive Regularization of Some Inverse Problems in Image Analysis (May 2017), arXiv:1705.03350 [cs]
  • [17] Hong, B.W., Koo, J.K., Dirks, H., Burger, M.: Adaptive Regularization in Convex Composite Optimization for Variational Imaging Problems (Feb 2017), arXiv:1609.02356 [cs]
  • [18] Hoopes, A., Hoffmann, M., Fischl, B., Guttag, J., Dalca, A.V.: HyperMorph: Amortized Hyperparameter Learning for Image Registration (May 2021), arXiv:2101.01035 [cs, eess]
  • [19] Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning 110(3), 457–506 (Mar 2021)
  • [20] Keelan, R., Shimada, K., Rabin, Y.: GPU-Based Simulation of Ultrasound Imaging Artifacts for Cryosurgery Training. Technology in Cancer Research & Treatment 16(1), 5–14 (Feb 2017), publisher: SAGE Publications Inc
  • [21] Kendall, A., Gal, Y.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? (Oct 2017), arXiv:1703.04977 [cs]
  • [22] Kim, B., Han, I., Ye, J.C.: DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 347–364. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2022)
  • [23] Klein, S., Staring, M., Murphy, K., Viergever, M., Pluim, J.: elastix: A Toolbox for Intensity-Based Medical Image Registration. IEEE Transactions on Medical Imaging 29(1), 196–205 (Jan 2010)
  • [24] Leclerc, S., Smistad, E., Pedrosa, J., Ostvik, A., Cervenansky, F., Espinosa, F., Espeland, T., Berg, E.A.R., Jodoin, P.M., Grenier, T., Lartizien, C., Dhooge, J., Lovstakken, L., Bernard, O.: Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography. IEEE Transactions on Medical Imaging 38(9), 2198–2210 (Sep 2019)
  • [25] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pp. 10012–10022 (2021)
  • [26] Ma, T., Dai, X., Zhang, S., Wen, Y.: PIViT: Large Deformation Image Registration with Pyramid-Iterative Vision Transformer. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 602–612. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2023)
  • [27] Mok, T.C.W., Chung, A.C.S.: Conditional Deformable Image Registration with Convolutional Neural Network. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 35–45. Lecture Notes in Computer Science, Springer International Publishing, Cham (2021)
  • [28] Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., van der Wilk, M., Glocker, B.: Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty. In: Advances in Neural Information Processing Systems. vol. 33, pp. 12756–12767. Curran Associates, Inc. (2020)
  • [29] Oliveira, F.P.M.: Medical Image Registration: a Review
  • [30] Qin, Y., Li, X.: FSDiffReg: Feature-wise and Score-wise Diffusion-guided Unsupervised Deformable Image Registration for Cardiac Images (Jul 2023), arXiv:2307.12035 [cs]
  • [31] Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Transactions on Medical Imaging 18(8), 712–721 (Aug 1999)
  • [32] Seitzer, M., Tavakoli, A., Antic, D., Martius, G.: On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks (Apr 2022), arXiv:2203.09168 [cs, stat]
  • [33] Shi, J., He, Y., Kong, Y., Coatrieux, J.L., Shu, H., Yang, G., Li, S.: XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention (Jun 2022), arXiv:2206.07349 [cs]
  • [34] Song, J., Meng, C., Ermon, S.: Denoising Diffusion Implicit Models (Oct 2022), arXiv:2010.02502 [cs]
  • [35] Ta, K., Ahn, S.S., Thorn, S.L., Stendahl, J.C., Zhang, X., Langdon, J., Staib, L.H., Sinusas, A.J., Duncan, J.S.: Multi-task learning for motion analysis and segmentation in 3d echocardiography. IEEE Transactions on Medical Imaging (2024)
  • [36] Wong, A., Fei, X., Hong, B.W., Soatto, S.: An Adaptive Framework for Learning Unsupervised Depth Completion. IEEE Robotics and Automation Letters 6(2), 3120–3127 (Apr 2021)
  • [37] Wong, A., Soatto, S.: Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5637–5646. IEEE, Long Beach, CA, USA (Jun 2019)
  • [38] Zhang, X., Dong, H., Gao, D., Zhao, X.: A comparative study for non-rigid image registration and rigid image registration. arXiv preprint arXiv:2001.03831 (2020)
  • [39] Zhang, X., Noga, M., Martin, D.G., Punithakumar, K.: Fully automated left atrium segmentation from anatomical cine long-axis mri sequences using deep convolutional neural network with unscented kalman filter. Medical image analysis 68, 101916 (2021)
  • [40] Zhang, X., Pak, D.H., Ahn, S.S., Li, X., You, C., Staib, L., Sinusas, A.J., Wong, A., Duncan, J.S.: Heteroscedastic uncertainty estimation for probabilistic unsupervised registration of noisy medical images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2024)
  • [41] Zhang, X., You, C., Ahn, S., Zhuang, J., Staib, L., Duncan, J.: Learning Correspondences of Cardiac Motion from Images Using Biomechanics-Informed Modeling. In: Camara, O., Puyol-Antón, E., Qin, C., Sermesant, M., Suinesiaputra, A., Wang, S., Young, A. (eds.) Statistical Atlases and Computational Models of the Heart. Regular and CMRxMotion Challenge Papers. pp. 13–25. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2022)

Supplementary Materials

Appendix 0.A Dataset details

0.A.1 ACDC [6]

The ACDC dataset, publicly accessible, comprises 2D cardiac MRI scans from 150 patients, with 100 subjects allocated for training and 50 for testing. Each sequence includes frames at end-diastole (ED) and end-systole (ES), along with corresponding myocardium labels. Our training set involves 80 randomly selected patients, the validation set consists of 20 patients, and the testing set comprises 50 patients. Extracting ED and ES image pairs is done in a slice-by-slice manner from the 2D longitudinal stacks. We perform a center crop for each slice pair, resulting in dimensions of 128×128128128128\times 128128 × 128 with respect to the myocardium centroid in the ED frame. This process yields a total of 751 2D image pairs for training, 200 pairs for validation, and an additional 538 pairs for testing.

0.A.2 CAMUS [24]

The CAMUS dataset, available to the public, comprises 2D cardiac ultrasound images from 500 individuals. Each individual contributes two distinct images: one for a 2-chamber view and another for a 4-chamber view. For every image, both end-diastole (ED) and end-systole (ES) frames, along with myocardium segmentation labels, are provided. We crop each image pair to 128×128128128128\times 128128 × 128, and through random selection, we use 300 subjects for training, 100 subjects for validation, and 100 subjects for testing. This process results in a total of 600 2D image pairs for training, 200 pairs for validation, and an additional 200 pairs for testing.

0.A.3 Private 3D Echo

The private 3D echo dataset contains 99 cardiac ultrasound scans with 8 sequences from synthetic ultrasound, 40 sequences from in vivo canine, and another 51 sequences from in vivo porcine. The details of the acquisition are omitted to preserve anonymity in the review process. ED and ES frames are manually identified and myocardium segmentation labels are provided for each sequence by experienced radiologists. Each 3D image is resized to 64×64×6464646464\times 64\times 6464 × 64 × 64 during training. We randomly selected 60 3D pairs for training, 19 pairs for validation, and another 20 pairs for testing. During testing, the estimated displacement is resized and rescaled to the original volume dimension and we compute anatomical scores of warped and target myocardium volumes afterwards.

Appendix 0.B Implementation details and hyperparameters

We trained all our models using NVIDIA V100/A5000 GPUs with 16/24 GB memory. Training the ACDC/CAMUS datasets requires approximately 1 hour for 300 epochs with a batch size of 8, whereas the private 3D Echo dataset demands around 6 hours for 150 epochs with a batch size of 4. On average, performing inference on each 2D image pair is completed in approximately 0.11 seconds, and each 3D Echo pair takes about 1 second. Both displacement and scoring estimators are trained using the Adam optimizer with a learning rate 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We chose β=1.5𝛽1.5\beta=1.5italic_β = 1.5 and α=0.02𝛼0.02\alpha=0.02italic_α = 0.02 for the ACDC dataset and α=0.04𝛼0.04\alpha=0.04italic_α = 0.04 for the CAMUS dataset as shown in Fig. 6 of the main paper.

Appendix 0.C Additional results on private 3D Echo

We present our qualitative result on the private 3D Echo dataset across three architectures with Fig. 7, where we show that our proposed framework achieves a better registration accuracy evidenced by better matching with the ground truth (yellow overlayed), smoother contour edges and locally consistent myocardial regions. We further show our quantitative results with Tab. 6, where our proposed framework outperforms other baselines on Voxelmorph and Diffusemorph architectures with comparable performance with the vanilla version on Transmorph architecture. This might be due to that the testing size of our private 3D Echo dataset is small with only 20 cases, which we aim to further evaluate on larger in vivo animal datasets.

Refer to caption Refer to caption Refer to caption
Figure 7: Quantitative evaluation of our method against the second-best approach on our private 3D Echo dataset. Cross-sectional slices are extracted from the 3D volumes for visualization. Each block, delineated by black solid lines, features source and target images with myocardium segmentation contours. The top row displays the original images, and the bottom row showcases our method’s results (warped source Is(x+u^)subscript𝐼𝑠𝑥^𝑢I_{s}(x+\hat{u})italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG )) alongside the second-best method. The yellow background indicates the ground truth ES myocardium. Dice scores are reported in the subtitles.
Refer to caption
Figure 8: Qualitative visualization of our proposed framework in Voxelmorph architecture [5] on our private 3D Echo validation sets. Cross-sectional slices are extracted from the 3D volumes for visualization. The third column exhibits successful matching, but the error map in the fourth column reveals residuals. Our predicted scoring map in the fifth column identifies and prevents drift of fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ), as demonstrated by the re-weighted error in the last column.
Table 6: Contour-based metrics compared against baselines on our private 3D Echo dataset. DSC (%), HD (vx), ASD (vx).

Private 3D Echo DSC \uparrow HD \downarrow ASD \downarrow CNN Voxelmorph [5] 77.23 16.27 2.14 NLL [21] 75.58 18.02 3.11 β𝛽\betaitalic_β-NLL [32] 76.80 16.78 2.88 AdaFrame [36] 76.66 16.25 2.11 AdaReg [37] 76.28 17.41 2.17 AdaCS (Ours) 77.33 16.36 2.13 Transformer Transmorph [8] 74.89 17.45 2.14 NLL [21] 70.56 19.60 2.00 β𝛽\betaitalic_β-NLL [32] 71.98 17.96 2.15 AdaFrame [36] 75.57 17.30 2.18 AdaReg [37] 74.39 18.14 2.13 AdaCS (Ours) 75.54 17.30 2.16 Diffusion Diffusemorph [22] 73.56 16.51 2.40 NLL [21] 70.36 17.81 2.50 β𝛽\betaitalic_β-NLL [32] 69.38 17.13 2.54 AdaFrame [36] 71.47 16.57 2.39 AdaReg [37] 73.32 17.75 2.38 AdaCS (Ours) 73.91 16.74 2.39

We additionally show the qualitative result of the adaptive scoring map on 3D private Echo with Fig. 8 during training. Cross-sectional slices are extracted from the 3D volumes for visualization. We show that our proposed scoring map in the fifth column is able to identify regions with loss of correspondence (e.g. by comparing regions in the first two columns) and adaptive re-weight the error residuals to prevent the displacement estimator from being driven away by the spurious error residuals, leading to performance improvement, consistent to our finding in Sec. 5 from the main paper, where we discuss results of our correspondence scoring map and adaptive weighting (see Fig. 4 in the main paper).

Appendix 0.D Additional results on public 3D dataset

We trained on the OASIS (Learn2Reg 2021) training set with 414 brain 3D MR scans, using corrected images aligned to the template space. Each image was preprocessed to a 160×\times×190×\times×224 1mm isotropic grid. We tested on validation set (skull stripped) and reported results in the Table below, where we consistently improved the baselines in terms of Dice across different anatomies. For the cLapIRN experiments, we finetune the model instead of training from scratch given its computational burden [27].

Table 7: Results on OASIS 3D Dataset [13]. Dice scores (%) are reported for cortex (CX), Subcortical-Gray-Matter (SGM), White-Matter (WM), Cerebrospinal fluid (CSF), and their average.

CX SGM WM CSF AVG Voxelmorph 80.89 68.45 87.35 34.49 67.80 Voxelmorph + AdaCS 83.30 71.52 86.97 35.50 69.32 cLapIRN 88.50 77.90 89.53 49.70 76.41 cLapIRN + AdaCS 89.50 79.85 88.41 49.72 76.87

Appendix 0.E Additional results on c-LapIRN architecture

We compared our method with Elastix and our Voxelmorph-based approach outperforms both as detailed in the left table. Notably, Elastix takes \approx45s per image pair, making it over 400 times slower than learning-based methods (\approx0.11s per pair). To evaluate the effectiveness of our proposed framework on registration approaches that use multiple deformation steps, we initially trained vanilla c-LapIRN (Mok et al., [27]) with NCC loss, but subsequently switched to MSE loss due to it being more stable and performant. Our proposed framework further improved c-LapIRN without needing meticulous hyperparameter tuning, demonstrating its applicability as shown in Tab. 8.

Table 8: Contour-based metrics against baselines on ACDC and CAMUS datasets.

ACDC [6] CAMUS [24] DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow Undeformed 47.98 7.91 2.32 66.77 10.87 2.61 Elastix 77.26 4.95 1.28 80.18 10.02 1.81 vanilla c-LapIRN 54.46 7.33 2.12 68.06 11.53 2.56 c-LapIRN + MSE 70.29 6.41 1.44 73.68 11.93 2.05 c-LapIRN + AdaCS 70.38 6.25 1.43 75.09 11.42 1.96 Voxelmorph 79.48 4.79 1.27 81.50 8.72 1.74 Voxelmorph + AdaCS 80.50 4.69 1.23 81.74 8.55 1.72

Appendix 0.F Loss ablations

We conduct a study of robust losses including NCC, MI, and Tukey’s biweight loss (TBL, with c=4.6851𝑐4.6851c=4.6851italic_c = 4.6851) in Tab. 9 where our framework consistently improves across various settings.

Table 9: Comparison with other robust loss functions (NCC, MI, TBL).

ACDC CAMUS DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow vxm NCC 78.55 4.94 1.29 77.01 10.23 1.89 MI 78.04 5.25 1.35 78.18 9.83 1.99 TBL 79.31 4.64 1.23 81.18 8.91 1.72 MAE 78.27 5.36 1.43 78.59 10.23 1.97 MSE 79.48 4.79 1.27 81.50 8.72 1.74 AdaCS 80.50 4.69 1.23 81.74 8.55 1.72 tsm NCC 73.77 6.64 1.12 73.03 11.87 1.70 MI 73.57 6.57 1.11 74.83 11.94 1.83 TBL 78.23 5.11 1.27 79.12 9.75 1.84 MAE 74.30 6.36 1.28 75.96 11.35 1.89 MSE 76.94 5.51 1.30 79.24 10.30 1.79 AdaCS 78.39 5.40 1.32 79.64 9.85 1.79 dfm NCC 70.25 5.29 1.58 75.67 10.75 2.06 MI 71.16 5.40 1.56 76.19 10.09 2.16 TBL 69.12 5.73 1.63 76.05 9.54 2.06 MAE 66.30 5.75 1.71 77.30 10.36 2.09 MSE 67.38 5.80 1.67 75.23 9.80 2.07 AdaCS 72.09 5.35 1.53 77.65 9.82 1.99

Appendix 0.G Altenative formulation

We present an analysis of an alternative formulation by taking both reconstructed target It^^subscript𝐼𝑡\hat{I_{t}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and target Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the same hyperparameter settings. Our proposed formulation (Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only) yields generally better performance while being more efficient (e.g. only takes one image), especially in the ACDC dataset as shown in Tab. 10.

Table 10: Comparison with an alternative formulation.

ACDC [6] CAMUS [24] DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow vxm Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It^^subscript𝐼𝑡\hat{I_{t}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG 79.78 4.77 1.25 81.36 8.92 1.76 Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Ours) 80.50 4.69 1.23 81.74 8.55 1.72 tsm Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It^^subscript𝐼𝑡\hat{I_{t}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG 76.40 5.56 1.35 79.67 9.88 1.79 Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Ours) 78.39 5.40 1.32 79.64 9.85 1.79 dfm Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It^^subscript𝐼𝑡\hat{I_{t}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG 69.05 5.71 1.63 77.16 9.53 2.01 Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Ours) 72.09 5.35 1.53 77.65 9.82 1.99

Appendix 0.H Limitation

We also present several failure cases when the volume of the myocardium is considerably thin, as shown in Fig. 9. This could stem from the complexity of necessitating precise predictions of displacement for unsupervised methods, a topic we intend to explore in future research. Additionally, though our methods are not super sensitive to hyperparameters as shown in Fig. 6 in the main paper, we still need to perform grid search of each α𝛼\alphaitalic_α and β𝛽\betaitalic_β for each dataset and architecture. We will explore a more efficient strategy with an amortized hyperparameter optimization in the future.

Refer to caption Refer to caption Refer to caption
Figure 9: Examples of failure cases on ACDC dataset when myocardium volume is considerably thin.

Appendix 0.I Additional results for visualization

To further illustrate the effectiveness of our proposed method, we present additional qualitative results compared with baselines in Fig. 10.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 10: Additional results for visualization of our method against the second-best approach in each dataset (top two rows: ACDC [6] and bottom two rows: CAMUS [22]). Each block, delineated by black solid lines, contains source and target images with myocardium segmentation contours. The top row displays the original images, and the bottom row showcases head-to-head comparison (warped source Is(x+u^)subscript𝐼𝑠𝑥^𝑢I_{s}(x+\hat{u})italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x + over^ start_ARG italic_u end_ARG )) between method and the second-best method. The yellow highlights indicates the ground truth ES myocardium. Dice scores are reported in the subtitles.