¹¹institutetext: Biomedical Engineering, Yale University, New Haven, USA
¹¹email: [email protected] ²²institutetext: Computer Science, Yale University, New Haven, USA ³³institutetext: Radiology & Biomedical Imaging, Yale University, New Haven, USA ⁴⁴institutetext: Department of Internal Medicine (Cardiology), Yale School of Medicine, New Haven, USA

Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

Xiaoran Zhang\orcidlink0000-0001-8918-7374 11 John C. Stendahl\orcidlink0000-0002-1568-9280 1144 Lawrence H. Staib\orcidlink0000-0002-9516-5136 1133 Albert J. Sinusas\orcidlink0000-0003-0972-9589 113344 Alex Wong\orcidlink0000-0002-3157-6016 22 James S. Duncan\orcidlink0000-0002-5167-9856 1133

Abstract

We propose an adaptive training scheme for unsupervised medical image registration. Existing methods rely on image reconstruction as the primary supervision signal. However, nuisance variables (e.g. noise and covisibility), violation of the Lambertian assumption in physical waves (e.g. ultrasound), and inconsistent image acquisition can all cause a loss of correspondence between medical images. As the unsupervised learning scheme relies on intensity constancy between images to establish correspondence for reconstruction, this introduces spurious error residuals that are not modeled by the typical training objective. To mitigate this, we propose an adaptive framework that re-weights the error residuals with a correspondence scoring map during training, preventing the parametric displacement estimator from drifting away due to noisy gradients, which leads to performance degradation. To illustrate the versatility and effectiveness of our method, we tested our framework on three representative registration architectures across three medical image datasets along with other baselines. Our adaptive framework consistently outperforms other methods both quantitatively and qualitatively. Paired t-tests show that our improvements are statistically significant. Code available at: https://voldemort108x.github.io/AdaCS/.

1 Introduction

Deformable medical image registration aims to accurately determine non-rigid correspondences through dense displacement vectors between source and target images. This process is a crucial step for medical image analysis, such as tracking disease progression for diagnosis and treatment [29, 14, 39]. Due to the impracticality of obtaining ground truth displacement, it has been a long-standing problem and has been extensively studied in the past decades [23, 31, 2, 5, 8, 22, 41, 38, 40].

Classical methods approach this challenge by solving an iterative pair-wise optimization problem between source and target images using elastic-type models [11, 23], free-form deformations with b-splines [31], and topology-preserving diffeomorphic models [2, 3]. However, these approaches are computationally expensive and time-consuming, limiting their practical utility in large-scale real-world data. Recently, learning-based approaches have been widely adopted for their speed (GPU accelerated forward pass that is hundreds of times faster in inference speed than classical methods) and state-of-the-art performance [5, 41, 22, 8, 9, 33]. Due to lack of ground truth displacement, these approaches leverage dense image or volumetric reconstruction in an unsupervised setting, or make use of segmentation masks in a weakly supervised setting. To train these approaches via gradient-based optimization, a source image is warped by the estimated displacement to reconstruct a target image. The supervision signal comes from minimizing the reconstruction error between warped source and target as the data-fidelity term, along with a regularizer based on the assumption that the object imaged is locally smooth and connected. The feasibility in minimizing this objective relies on intensity constancy between the two images during imaging.

Refer to caption — Figure 1: Existing approaches assume uniform intensity constancy and covisibility across the entire image; during training, this causes irreconcilable penalties, i.e., regions with large error residuals due to the absence of correspondence as highlighted in the red box. Our proposed approach addresses this by re-weighting error residuals with a predictive correspondence scoring map. By doing so, we get a smoother optimization when we reduce the influence of these outliers, leading to an improved performance.

However, assuming such intensity constancy uniformly across the entire image domain neglects the facts of motion ambiguity and non-uniform noise corruption in medical images. Assuming sufficiently exciting local regions, only a subset of pixels of one image can be uniquely matched or corresponded to another based on the image intensity profiles. As shown in Fig. 1, despite a displacement estimator model predicting largely correct corresponding pixels between the two images, the error residuals between the warped source and target images are still non-zero. In fact, they are dominated by regions with no correspondence; when this phenomenon occurs during the training of a displacement estimator, it results in performance degradation (driven towards an undesirable minimum) in the subsequent optimization steps due to the noisy gradients. To address this issue, we propose an unsupervised correspondence scoring framework that identifies regions with high chance of establishing correspondence and adaptively reduces the influence of error residuals from nuisance variability during training. This prevents the displacement estimator from drifting away due to large residuals caused by the lack of correspondence between input image pairs.

Our proposed scoring estimator is deployed during the unsupervised training of a displacement estimator and yields a soft scoring map. It is optimized alternatingly together with the displacement estimator to adaptively re-weight the data-fidelity term in order to mitigate the negative impact of nuisances when establishing correspondence. We introduce an unsupervised training objective for the scoring estimator to learn an accurate correspondence scoring map without additional annotation. Yet, there exists a trivial solution of predicting a score map of all zeros. Therefore, we additionally propose a regularizer for the scoring estimator to bias it away from the trivial solution, along with a momentum-guided adaptive total variation to encourage smoothness in the scoring map. To validate the effectiveness of our proposed method, we tested on three different datasets including: (1) ACDC [6], a public 2D MRI dataset, (2) CAMUS [24], a public 2D ultrasound dataset, and (3) a private 3D echocardiography dataset [1, 35]. To further show the versatility of our proposed method, we tested on three representative unsupervised image registration architectures including: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22] along with other baselines. Our proposed framework can be applied in a plug-and-play manner to consistently improve existing methods. Paired t-tests show that the improvements obtained by utilizing our approach are statistically significant.

Our contributions are as follows: (1) We propose an adaptive framework that incorporates correspondence scoring for unsupervised deformable medical image registration. (2) We introduce an unsupervised correspondence scoring network to be used during the training of a displacement estimator. The scoring network learns to determine whether a given image allows for establishing correspondence by minimizing the typical image reconstruction loss with scoring and momentum-guided adaptive regularization. (3) Our proposed method consistently outperforms other baselines across three representative registration architectures over three medical image datasets with diverse modalities. The performance gain comes with no cost in memory or run time during inference, but only the deployment of our scoring estimation during training.

2 Related works

Unsupervised medical image registration. Balakrishnan et al. [5] proposed an unsupervised learning framework using U-Net as the displacement estimator. This framework imposes intensity constancy by minimizing the mean squared error between the warped source image and the target image to update the parametric displacement estimator via gradient-based optimization. A number of works have been developed upon this architecture including adding diffeomorphic regularization (Voxelmorph-diff) [10], jointly learning amortized hyperparameters (Hypermorph) [18] and learning contrast-invariant registration without acquired images (SynthMorph) [15]. Recently, Zhang et al. [40] proposed an uncertainty estimation framework that extends the widely used homoscedastic assumption in objectives to heteroscedastic assumption. Inspired by the recent advances in vision transformers [25, 12], Chen et al. [8] introduced TransUNet [7], a hybrid Transformer-CNN architecture. This design replaces encoders with Swin Transformer [25] to enhance the receptive field while preserving convolutional decoders to bolster the model’s ability to capture long-range motion. A number of extensions to this architecture have been proposed, such as substituting convolutional decoders with transformer layers [33] and incorporating multi-scale pyramids [26]. Additionally, score-based diffusion models such as DDPM [34] have shown high-quality performance in generative modeling. To leverage the advantage of DDPM, Kim et al. [22] presents a diffusion-based architecture, composed of a diffusion network and a deformation network. Recent work built upon this architecture explores adding feature and score-wise diffusion [30].

We demonstrate our proposed adaptive framework and compare it with with other related formulations as baselines on three representative architectures including: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22]. We also tested our proposed framework on c-LapIRN [27] and deferred the results to the Supp. Mat. due to the page limit.

Adaptive weighting schemes. A wide range of image processing problems involve optimizing an energy function that combines a data-fidelity term and a regularization term. The relative importance between the two terms is usually weighted by a scalar, which disregards the heteroscedastic nature of error residuals [36]. To address this challenge, several adaptive weighting schemes are proposed in the spatial domain and over the course of optimization based on the local residual [17, 16, 37]. Wong et al. [36] later provides a data-driven algorithm that deals with multiple frames [36]. Zhang et al. [40] proposed a … In this paper, we selected AdaReg [37] and AdaFrame [36] as our baselines for adaptive weighting. The aforementioned methods are not learning-based; whereas, we proposed a learning-based correspondence scoring in a collaborative framework.

Aleatoric uncertainty estimation. Our proposed adaptive correspondence scoring is conceptually related to aleatoric uncertainty modeling in the Bayesian learning framework, which aims to estimate input-dependent noise inherent in the observations [21, 4, 28, 19, 32]. This can be attributed to for example motion noise or sensor noise, resulting in uncertainty which cannot be reduced even if more data were to be collected. Kendall et al. [21] proposed a maximum likelihood estimation (MLE) formulation that minimizes the negative log-likelihood (NLL) criterion using stochastic gradient descent. This approach re-weights the data-fidelity term using predictive variance estimates mediated by a regularization term after assuming noise distribution is heteroscedastic Gaussian. Seitzer et al. [32] later identifies that such formulation using inverse variance weighting will result in over-confident of variance estimates, leading to undesired undersampling. Thus, an exponentiated $\beta$ term is proposed in the new loss formulation, termed $\beta$ -NLL, to counteract the undersampling leading to undesired performance. In this paper, we selected NLL [21] and $\beta$ -NLL [32] as baselines for aleatoric uncertainty estimation with a U-Net based variance estimator that is jointly trained with displacement estimator under each formulation.

3 Preliminaries

Let $I_{s}$ be the source image and $I_{t}$ be the target image, where $I:\Omega\mapsto[0,1]$ is the imaging function after normalization and $\Omega$ is the image space. Deformable image registration aims to estimate a dense displacement vector field that characterizes the correspondence between two images for each pixel $\hat{u}:\Omega\mapsto\mathbbm{R}^{2}$ ( $\Omega\mapsto\mathbbm{R}^{3}$ for 3D data). For ease of notation and terminology, we will use the terms images and pixels to refer to both 2D and 3D data.

Due to the lack of ground truth, intensity constancy and the smoothness assumption are imposed to constrain the parametric model $f_{\theta}(\cdot)$ (e.g., a neural network) as displacement estimator $\hat{u}=f_{\theta}(I_{t},I_{s})$ by minimizing the following objective to update the parameters $\theta$ in the model

\mathcal{L}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\underbrace{[I_{t}(x)-I_{s}(x+% \hat{u}(x))]^{2}}_{\mathcal{L}_{\text{data}}}+\lambda\|\nabla\hat{u}(x)\|^{2},

(1)

where $x\in\Omega$ denotes a coordinate and $\lambda$ denotes the hyperparameter to modulate the trade-off between two terms. In this work, we explore three representative deep neural network architectures that serve as displacement estimators including (1) convolution-based (Voxelmorph [5]), (2) transformer-based (Transmorph [8]) and (3) diffusion-based (Diffusemorph [22]).

4 Methods

4.1 Adaptive displacement estimation

To train these networks in an unsupervised fashion, one typically minimizes Eq. 1 with respect to the model parameters. However, the data term $\mathcal{L}_{\text{data}}$ , which measures the intensity difference between two estimated correspondences, relies on intensity constancy and assumes that surfaces reflecting the physical waves (i.e., ultrasound) are largely Lambertian [20] and the acquisition techniques are consistent – in which case, one can determine unique correspondences between two images, if they are covisible. In the case where the corresponding pixel is not covisible, the solution cannot be uniquely determined and one must rely on regularization, i.e., local smoothness modeled by a diffusion regularizer. Under realistic scenarios, the Lambertian assumption is often violated and the difficulty of establishing unique correspondences is further exacerbated by the presence of noise from the sensor.

These conditions can introduce erroneous supervision signals when optimizing the parameters of the model. Suppose that one were to correctly identify the corresponding pixels between two images, the above nuisance factors would still cause the data-fidelity term of Eq. 1 to yield non-zero residuals, which induces gradients when backpropagating. Within the optimization of the weights $\theta$ , this update may translate to moving out of a local (possibly optimal) minima and result in performance degradations as shown in Fig. 1.

Thus, we propose an adaptive framework shown in Fig. 2 that incorporates a predictive correspondence scoring map to prevent displacement estimation (de) from being dominated by error residuals due to nuisance variability. Our method is realized as an adaptive weighting term that can be generically integrated into the conventional loss function for unsupervised training:

\mathcal{L}_{\text{de}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\lfloor\hat{S}(x)% \rfloor[I_{t}(x)-I_{s}(x+\hat{u}(x))]^{2}+\lambda\|\nabla\hat{u}(x)\|^{2}.

(2)

$\hat{S}:\Omega\mapsto[0,1]$ is a dense predictive pixel-wise scoring map to model the confidence score of how well one can establish correspondence between the two images, where higher values indicate a lower degree of effect from the nuisance variables (i.e., covisibility, noise, etc.). Note that this is equivalent to treating $\hat{S}$ as the degree to which nuisance variables affect establishing correspondence between source and target images, and using its complement $1-\hat{S}$ to downweight pixels that lack correspondence. Floor symbol $\lfloor\cdot\rfloor$ denotes stop gradient operation.

Unsupervised correspondence scoring. The scoring estimator $g_{\phi}(\cdot)$ predicts a soft map to show the degree of correspondence existence from the target image as $\hat{S}=g_{\phi}(I_{t})$ . During training, the scoring estimator is optimized alternatingly with the displacement estimator $f_{\theta}(\cdot)$ under the following unsupervised correspondence scoring (ucs) objective

\displaystyle\mathcal{L}_{\text{ucs}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\hat{% S}(x)[I_{t}(x)-I_{s}(x+\lfloor\hat{u}(x)\rfloor)]^{2}.

(3)

Eq. 3 encourages the estimator to assign a lower score to regions with higher error residuals computed using the displacement estimated by $f_{\theta}(\cdot)$ . Note that $\hat{u}$ is detached by stop gradient operator $\lfloor\cdot\rfloor$ to avoid doubly traversing the computational graph. However, minimizing Eq. 3 alone will lead to a trivial solution where all scores are zeros. Thus, proper regularization is needed to prevent the estimator from learning such a solution.

Scoring estimator regularization. Given the range of the correspondence scoring map is $\hat{S}(x)\in[0,1]$ , we design an objective to regularize the scoring map to avoid the solution of all zeros:

\displaystyle\mathcal{L}_{\text{reg}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[1-% \hat{S}(x)]^{2}.

(4)

With the above regularizer alone, the scoring map will inevitably be non-smooth (e.g., flipping scores between pixels) around neighboring regions that the model identifies as low correspondence during training. This is not desirable since such irregularities in the scoring map will also result in irregular supervision signal for the displacement estimator (i.e., large discrepancies in the data term within a neighborhood). This creates artifacts and distortion in image warping, which will degrade resulting performance (Tab. 2). Thus, we impose a smoothness term in the predictive scoring map to preserve such characteristics.

Momentum guided adaptive smoothness regularization. To impose smoothness regularization on the estimated correspondence scoring map, we introduce a momentum-guided adaptive weighting strategy that follows the training dynamics of the displacement estimator. In the early stages of training, the displacement and scoring estimators will be inaccurate. Having strong smoothness will impede the learning by constraining scores; thus, lower degree of regularization during this phase should be imposed to allow for exploration to find suitable hypotheses. In the later stage of training, we increase degree of regularization to allow for convergence when re-weighting error residuals.

To accommodate this design, we utilize the momentum of error residuals as an indicator of the current training status, adjusting the degree of the smoothness constraint accordingly. The mean residuals at the training step $T$ is given by

\mu_{T}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[I_{t}(x)-I_{s}(x+\lfloor\hat{u}_{T% }(x)\rfloor)]^{2}.

(5)

Recall that $I_{s}(x),I_{t}(x)\in[0,1]$ and $\lfloor\hat{u}_{T}\rfloor$ denotes stopping gradient on estimated displacement. Note: the mean residuals are within range $\mu_{T}\in[0,1]$ . We then apply a cosine function that is monotonically decreasing concavely as the activation

b_{T}=\cos\frac{\pi}{2}\mu_{T}.

(6)

To compute the momentum of residuals $m_{T}$ , we apply the exponential moving average across different time steps with decay factor $\gamma=0.99$ and $m_{0}=0$ as

m_{T}=\gamma m_{T-1}+(1-\gamma)b_{T}.

(7)

We use the computed momentum $m_{T}\in[0,1]$ as the adaptive weight to reflect the training progress and use it to guide the diffusion regularizer:

\mathcal{L}_{\text{smooth}}=m_{T}\frac{1}{|\Omega|}\sum_{x\in\Omega}\|\nabla% \hat{S}(x)\|^{2}.

(8)

By utilizing the scoring estimator during training, we prevent the displacement estimator from escaping local minima by re-weighting the error residuals with a smooth pixel-wise correspondence scoring, leading to registration performance improvement (Tab. 1). The effectiveness of momentum-based weighting is shown in Fig. 5. To exploit the characteristics of the displacement estimator and scoring estimator, a proper optimization strategy needs to be designed to ensure the effectiveness of the proposed adaptive framework as detailed below.

4.2 Optimization of proposed adaptive framework

Our loss for the scoring estimator $g_{\phi}(\cdot)$ is a combination of Eqs. 3, 4 and 8

\mathcal{L}_{\text{se}}=\mathcal{L}_{\text{ucs}}+\alpha\mathcal{L}_{\text{reg}% }+\beta\mathcal{L}_{\text{smooth}}

(9)

with hyperparameter $\alpha$ and $\beta$ to modulate the trade-off between different terms.

To optimize displacement estimator $f_{\theta}(\cdot)$ and scoring estimator $g_{\phi}(\cdot)$ in training, we propose a collaborative strategy summarized in Algorithm 1. Since the parameter update of the displacement estimator and scoring estimator is correlated, as defined in Eq. 2 and Eq. 9, the noise in gradients tends to get exacerbated in the early stage of training when both estimators fail to provide each other an accurate prediction in order to properly optimize. To prevent this, we propose a warm-up stage that trains $f_{\theta}(\cdot)$ and $g_{\phi}(\cdot)$ individually for $N_{w}$ epochs to reduce the error propagation in between. After the warm-up stage, we perform alternating optimization for both estimators.

Data: Source image

I_{s}

and target image

I_{t}

Result: Estimated displacement

\hat{u}

1 Initialization,

N_{w}

: warm-up epochs,

N

: number of epochs;

2 while epoch $i$ to $N$ do

3 if $i<N_{w}$ then

4 flag_disp = True, flag_score = False;

6 else if $N_{w}\leq i<2N_{w}$ then

7 flag_disp = False, flag_score = True;

9 else

10 flag_disp = True, flag_score = True;

12 if flag_disp then

\hat{u}=f_{\theta}(I_{s},I_{t})

;

14 if flag_score then

\hat{S}=g_{\phi}(I_{s})

;

17 else

\hat{S}=\mathbbm{1}

;

\theta=\theta-\eta\frac{\partial\mathcal{L}_{\text{de}}(I_{s}(x+\hat{u}),I_{t}% ,\lfloor\hat{S}\rfloor)}{\partial\theta}

;

22 if flag_score then

\hat{u}=f_{\theta}(I_{s},I_{t})

;

\hat{S}=g_{\phi}(I_{s})

;

\phi=\phi-\eta\frac{\partial\mathcal{L}_{\text{se}}(I_{s}(x+\lfloor\hat{u}% \rfloor),I_{t},\hat{S})}{\partial\phi}

;

Algorithm 1 Training loop

By training our proposed adaptive framework using the above optimization strategy (Algorithm 1), we prevent the displacement estimator from drifting due to noisy gradients induced by nuisance variables such as covisibility and noise by re-weighting the error residuals using our adaptive correspondence scoring.

5 Experiments

Datasets. We tested our proposed framework on three different cardiac datasets, including two 2D public datasets, containing two medical imaging modalities (MRI and ultrasound), and one private 3D dataset. To construct the image pair, we selected the end-diastole (ED) frame as the source image and the end-systole (ES) frame as the target image. ED to ES registration is considered long-range and most challenging in the cardiac sequence. The detailed steps of dataset preprocessing can be found in the Supp. Mat.

ACDC [6]. The ACDC dataset contains 2D human cardiac MRI from 150 patients with various cardiac conditions. We randomly selected 80 patients containing 751 image pairs for training, 20 patients containing 200 pairs for validation, and the remaining 50 patients containing 538 pairs for testing.

CAMUS [24]. The CAMUS dataset contains 2D human cardiac ultrasound images from 500 subjects. We randomly selected 600 image pairs for training, 200 pairs for validation, and another 200 pairs for testing.

Private 3D Echocardiography [35, 1]. To validate the effectiveness of our proposed method, we also tested on a private 3D echocardiography dataset and reported our results in the Supp. Mat.

Evaluation metrics. We evaluate our results quantitatively by warping myocardium segmentation in the source image with our predicted displacement $\hat{u}$ and compute anatomical conformance in terms of (1) Dice coefficient score (DSC) (2) Hausdorff distance (HD) and (3) Average surface distance (ASD) with ground truth segmentation in the target image. Definitions of metrics can be found in the Supp. Mat.

Implementation details. All our experiments were implemented using Pytorch on NVIDIA V100/A5000 GPUs. The architecture of the scoring estimator is implemented on a U-Net backbone. Code is provided to ensure reproducibility. To show the versatility of our proposed framework, we tested on three representative unsupervised registration architectures for each dataset: (1) Voxelmorph [5], (2) Transmorph [8] and (3) Diffusemorph [22]. We also tested on c-LapIRN architecture [27] and present the quantitative results in the Supp. Mat. due to the page limit. The descriptions of baselines are summarized below and remaining details (i.e., hyperparameters) can be found in the Supp. Mat.

Adaptive weighting schemes. We present two baselines AdaReg and AdaFrame that, like us, utilize adaptive weighting schemes during training.

AdaReg [37]. We compute first the local error residual $\rho=|I_{t}(x)-I_{s}(x+\hat{u})|$ and the global residual $\sigma=\frac{1}{\frac{1}{|\Omega|}\sum_{x\in\Omega}|I_{t}(x)-I_{s}(x+\hat{u})|}$ using the displacement estimator prediction $\hat{u}$ . We then compute the adaptive regularization weighting as $\alpha(x)=\exp(-\frac{c\rho}{\sigma})$ , where $c=50$ . We then optimize the displacement estimator using the loss $\mathcal{L}_{\text{AdaReg}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}[I_{t}(x)-I_{s}% (x+\hat{u})]^{2}+\lambda\|\alpha(x)\nabla\hat{u}\|^{2}$ .

AdaFrame [36]. We first compute the local error residual $\delta=|I_{t}(x)-I_{s}(x+\hat{u})|$ and then normalize it with its mean $\mu$ and standard deviation $\sigma$ as $\rho=\frac{\delta-\mu}{\sqrt{\sigma^{2}+\epsilon}}$ . We then compute the adaptive weight activated by a scaled and shifted sigmoid function as $\alpha(x)=1-\frac{1}{1+\exp(-(a\rho-b))}$ where $a=\frac{a_{0}}{\mu+\epsilon}$ and $b=b_{0}(1-\cos\pi\mu)$ . We choose $a_{0}=0.1$ and $b_{0}=10$ . We then optimize the displacement estimator using the loss $\mathcal{L}_{\text{AdaFrame}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\alpha(x)[I_{% t}(x)-I_{s}(x+\hat{u})]^{2}+\lambda\|\nabla\hat{u}\|^{2}$ .

Aleatoric uncertainty estimation. We present two baselines that utilize aleatoric uncertainty estimates in terms of predictive variance to weight the objective during training adaptively. In order to obtain the predictive variance, we utilize a U-Net as variance estimator $h(\cdot)$ that takes target image $I_{t}$ and warped source $I_{s}(x+\hat{u})$ as input to predict the noise variance $\hat{\sigma}_{I}=h(I_{t},I_{s}(x+\hat{u}))$ . The variance estimator is trained jointly with the displacement estimator.

NLL [21]. To jointly train the displacement and variance estimators, we compute the loss as $\mathcal{L}_{\text{NLL}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\frac{1}{\hat{% \sigma}_{I}^{2}(x)}[I_{t}(x)-I_{s}(x+\hat{u})]^{2}+\log\hat{\sigma}^{2}_{I}(x)$

$\beta$ -NLL [32]. To jointly train the displacement and variance estimators under $\beta$ -NLL objective, we compute the loss as $\mathcal{L}_{\beta\text{-NLL}}=\frac{1}{|\Omega|}\sum_{x\in\Omega}\lfloor\hat{% \sigma}_{I}(x)^{2\beta}\rfloor(\frac{1}{\hat{\sigma}_{I}^{2}(x)}[I_{t}(x)-I_{s% }(x+\hat{u})]^{2}+\log\hat{\sigma}^{2}_{I}(x))$ where $\beta=0.5$ .

6 Results

Registration accuracy. We present our quantitative evaluation in Tab. 1, where our proposed method shows consistent improvement over other baselines. Surprisingly, we observe that incorporating baselines (adaptive weighting schemes and uncertainty estimation) tend to harm performance. Unlike them, our method improves over the base models. Compared to the base models (Voxelmorph [5], Transmorph [8], and Diffusemorph [22]), which are the second-best methods in each architecture, our proposed method performs better especially in terms of Dice score on each dataset. We additionally conducted a paired t-test of our proposed method to show that our consistent improvement is statistically significant as in Tab. 3.

Table 1: Comparisons on contour-based metrics. Units: DSC (%) HD (vx) ASD (vx). Our method consistently improves on registration accuracy across different architectures and datasets.

ACDC [6] CAMUS [24] DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ Undeformed 47.98 7.91 2.32 66.77 10.87 2.61 Elastix [23] 77.26 4.95 1.28 80.18 10.02 1.81 CNN Voxelmorph [5] 79.48 4.79 1.27 81.50 8.72 1.74 NLL [21] 76.49 5.46 1.45 75.24 11.05 2.20 $\beta$ -NLL [32] 78.74 5.07 1.33 79.75 9.39 1.93 AdaFrame [36] 66.38 5.80 1.67 77.88 10.54 1.93 AdaReg [37] 78.75 5.13 1.33 79.31 9.78 1.88 AdaCS (Ours) 80.50 4.69 1.23 81.74 8.55 1.72 Transformer Transmorph [8] 76.94 5.51 1.30 79.24 10.30 1.79 NLL [21] 73.12 7.22 1.27 75.08 11.60 1.79 $\beta$ -NLL [32] 75.74 6.12 1.29 77.39 10.99 1.86 AdaFrame [36] 67.95 5.72 1.59 78.06 9.86 1.91 AdaReg [37] 76.22 5.68 1.29 78.12 10.62 1.84 AdaCS (Ours) 78.39 5.40 1.32 79.64 9.85 1.79 Diffusion Diffusemorph [22] 67.38 5.80 1.67 75.23 9.80 2.07 NLL [21] 66.24 5.84 1.73 74.78 10.62 2.15 $\beta$ -NLL [32] 66.31 5.93 1.74 73.27 9.85 2.25 AdaFrame [36] 59.78 6.46 1.93 75.04 10.41 2.10 AdaReg [37] 69.41 6.25 1.78 74.36 10.66 2.21 AdaCS (Ours) 72.09 5.35 1.53 77.65 9.82 1.99

Table 2: Ablation study on loss terms. Note: Removing

\mathcal{L}_{\text{reg}}

leads to a degenerate solution of all-zeros for the scoring estimator

g_{\phi}(\cdot)

Loss ACDC [6] CAMUS [24] $\mathcal{L}_{\text{reg}}$ ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{L}_{\text{% smooth}}}$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ vxm ✓ ✗ 80.24 4.64 1.23 81.58 8.89 1.74 ✓ ✓ 80.50 4.69 1.23 81.74 8.55 1.72 tsm ✓ ✗ 77.84 5.41 1.33 79.58 10.17 1.81 ✓ ✓ 78.39 5.40 1.32 79.64 9.85 1.79 dfm ✓ ✗ 71.62 5.56 1.58 77.32 9.71 2.00 ✓ ✓ 72.09 5.35 1.53 77.65 9.82 1.99

Table 3: Paired t-test of our proposed method vs second-best method in Tab. 1 for each dataset in terms of DSC. The p-values shows that our improvement is statistically significant.

Voxelmorph [5] Transmorph [8] Diffusemorph [22] ACDC [6] $p=5.85\times 10^{-16}$ $p=0.046$ $p=1.01\times 10^{-41}$ CAMUS [24] $p=0.01$ $p=0.01$ $p=4.56\times 10^{-41}$

To quantitatively evaluate the registration performance, we plot the warped source image along with segmentation overlayed with ground truth shown in Fig. 3. Our proposed framework is visually better across datasets with different modalities and frameworks. Both the quantitative and qualitative evaluations show that our proposed framework captures more accurate correspondence, validating the effectiveness of our proposed adaptive scoring.

Tab. 1 shows that both uncertainty weighting strategies, NLL [21] and $\beta$ -NLL [32], did not yield improvements over the baseline. This observation indicates that optimization of uncertainty estimation cannot be approached with a simple joint optimization alongside image registration. Furthermore, we observe that adaptive regularization techniques, exemplified by AdaFrame [36] and AdaReg [37], were ineffective in enhancing performance, which could be attributed to the computation of adaptive weights based on statistical assumptions of error residuals contrary to our proposed unsupervised learning approach. We note a discernible decline in performance when transitioning from Voxelmorph [5], which utilizes ConvNets, to Transmorph [8] and Diffusemorph [22]. This decline underscores the challenges faced by transformer and diffusion-based models when applied to datasets comprising smaller-scale medical images. We also present some failure cases in the Supp. Mat., where the myocardium (typically challenging due to irregular volumes) is considerably thin.

Results of correspondence scoring map and adaptive weighting. To qualitatively evaluate the effectiveness of our proposed correspondence scoring during training, we present Fig. 4 to show that our predicted scoring map accurately identifies the regions with low correspondence and prevents the displacement estimator $f_{\theta}(\cdot)$ from drifting away by re-weighting the error residuals.

In Fig. 5, we illustrate the behavior of our proposed adaptive momentum-guided weight, denoted as $m_{T}$ (Eq. 7). This weight dynamically adapts during training, responding to the evolving characteristics of error residuals (Eq. 5). The observed evolution of our adaptive weight aligns with our design objective, progressively enhancing smoothness penalization as training toward convergence. The effectiveness of $m_{T}$ is also evidenced by our quantitative ablation analysis in Fig. 5, in which our proposed formulation generally achieves a better performance with the adaptive weight.

Ablation study. We present Tab. 2 for ablation studies to show the effectiveness of each loss term. Without $\mathcal{L}_{\text{reg}}$ and $\mathcal{L}_{\text{smooth}}$ , the predicted correspondence scoring map $\hat{S}$ will result in a degenerate solution of all zeros. By introducing regularization and smoothness constraints, the performance of our proposed framework steadily increases across various datasets and registration architectures. We also perform an ablation study comparing with robust losses including mutual information (MI) and normalized cross correlation (NCC) shown in Tab. 4 in the Supp. Mat. Our conclusions still hold as our method yields consistent performance improvements over the baselines.

Training stability and sensitivity to hyperparameters. We first show the training stability by plotting the training curves of our proposed unsupervised scoring term $\mathcal{L}_{\text{ucs}}$ in Eq. 3 and regularization term $\mathcal{L}_{\text{reg}}$ in Eq. 4 in the left figure of Fig. 6, demonstrating both terms can be jointly minimized. We further conduct a sensitivity study on hyperparameters $\alpha$ and $\beta$ in Eq. 9. While they are not very sensitive (change within $\approx$ 1% Dice for various $\alpha,\beta$ ), tuning more can indeed yield better results.

Table 5: Dice standard deviation, % of

|J_{\hat{u}}|\leq 0

for displacement smoothness evaluation, and training time (measured in hours) of 300 epochs.

ACDC CAMUS DSC $\uparrow$ $|J_{\hat{u}}|\leq 0$ $\downarrow$ $T_{\text{train}}$ DSC $\uparrow$ $|J_{\hat{u}}|\leq 0$ $\downarrow$ $T_{\text{train}}$ Voxelmorph 79.48 $\pm$ 9.23 0.29 0.26 81.50 $\pm$ 5.58 0.60 0.26 AdaCS (Ours) 80.50 $\pm$ 8.58 0.22 0.43 81.74 $\pm$ 5.36 0.30 0.43 Transmorph 76.94 $\pm$ 8.93 0.76 0.60 79.24 $\pm$ 6.06 1.41 0.59 AdaCS (Ours) 78.39 $\pm$ 9.06 0.57 0.87 79.64 $\pm$ 6.37 0.70 0.70 Diffusemorph 67.38 $\pm$ 15.65 0.05 1.08 75.23 $\pm$ 8.71 0.05 1.06 AdaCS (Ours) 72.09 $\pm$ 13.60 0.06 1.86 77.65 $\pm$ 7.64 0.08 1.91

Evaluation on smoothness and training time. We provide standard deviation and measure of smoothness (percentage of estimated displacement $\hat{u}$ with a negative Jacobian determinant, i.e., $|J_{\hat{u}}|\leq 0$ ) in Table 5. All methods have $<$ 1%, implying physical plausibility and smoothness. Our method produces smoother deformations in Voxelmorph and Transmorph and achieves similarly smooth deformations in Diffusemorph, with $<$ 0.1%.

7 Conclusion

In this paper, we propose an adaptive correspondence scoring framework for unsupervised image registration to prevent the displacement estimator from drifting away by noisy gradients caused by low correspondence due to the nuisance variables such as noise or covisibility during training. We introduce a unsupervised correspondence scoring estimation scheme with both scoring and momentum-guided adaptive regularizations to prevent the scoring estimator from a degenerate solution and ensure scoring map smoothness. We demonstrate the effectiveness of our proposed framework on three representative registration architectures and we show consistent improvement compared with other baselines across three medical image datasets with diverse modalities. Though our proposed framework is promising, the hyperparameters in the scoring estimator loss does need to be tuned for each architecture on each dataset. Nonetheless, performance is not too sensitive to hyperparameter tuning, so one does not need to tune meticulously. In the future, we aim to explore an amortized hyperparameter optimization scheme during training to reduce the computation and validate our proposed framework on clinical datasets for further impact.

Acknowledgements

This work is supported by NIH/NHLBI grant R01HL121226.

References

[1] Ahn, S.S., Ta, K., Thorn, S.L., Onofrey, J.A., Melvinsdottir, I.H., Lee, S., Langdon, J., Sinusas, A.J., Duncan, J.S.: Co-attention spatial transformer network for unsupervised motion tracking and cardiac strain analysis in 3d echocardiography. Medical image analysis 84, 102711 (2023)
[2] Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1), 95–113 (Oct 2007)
[3] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis 12(1), 26–41 (Feb 2008)
[4] Bae, G., Budvytis, I., Cipolla, R.: Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13117–13126. IEEE, Montreal, QC, Canada (Oct 2021)
[5] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging 38(8), 1788–1800 (Aug 2019)
[6] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Gonzalez Ballester, M.A., Sanroma, G., Napel, S., Petersen, S., Tziritas, G., Grinias, E., Khened, M., Kollerathu, V.A., Krishnamurthi, G., Rohe, M.M., Pennec, X., Sermesant, M., Isensee, F., Jager, P., Maier-Hein, K.H., Full, P.M., Wolf, I., Engelhardt, S., Baumgartner, C.F., Koch, L.M., Wolterink, J.M., Isgum, I., Jang, Y., Hong, Y., Patravali, J., Jain, S., Humbert, O., Jodoin, P.M.: Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Transactions on Medical Imaging 37(11), 2514–2525 (Nov 2018)
[7] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (Feb 2021), arXiv:2102.04306 [cs]
[8] Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y.: TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis 82, 102615 (Nov 2022)
[9] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised Learning for Fast Probabilistic Diffeomorphic Registration. vol. 11070, pp. 729–738 (2018), arXiv:1805.04605 [cs]
[10] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57(1), 226–236 (Oct 2019)
[11] Davatzikos, C.: Spatial Transformation and Registration of Brain Images Using Elastically Deformable Models. Computer Vision and Image Understanding 66(2), 207–222 (May 1997)
[12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Jun 2021), arXiv:2010.11929 [cs]
[13] Hering, A., et al.: Learn2reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning (2022)
[14] Hill, D.L.G., Batchelor, P.G., Holden, M., Hawkes, D.J.: Medical image registration
[15] Hoffmann, M., Billot, B., Greve, D.N., Iglesias, J.E., Fischl, B., Dalca, A.V.: SynthMorph: learning contrast-invariant registration without acquired images. IEEE Transactions on Medical Imaging 41(3), 543–558 (Mar 2022), arXiv:2004.10282 [cs, eess, q-bio]
[16] Hong, B.W., Koo, J.K., Burger, M., Soatto, S.: Adaptive Regularization of Some Inverse Problems in Image Analysis (May 2017), arXiv:1705.03350 [cs]
[17] Hong, B.W., Koo, J.K., Dirks, H., Burger, M.: Adaptive Regularization in Convex Composite Optimization for Variational Imaging Problems (Feb 2017), arXiv:1609.02356 [cs]
[18] Hoopes, A., Hoffmann, M., Fischl, B., Guttag, J., Dalca, A.V.: HyperMorph: Amortized Hyperparameter Learning for Image Registration (May 2021), arXiv:2101.01035 [cs, eess]
[19] Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning 110(3), 457–506 (Mar 2021)
[20] Keelan, R., Shimada, K., Rabin, Y.: GPU-Based Simulation of Ultrasound Imaging Artifacts for Cryosurgery Training. Technology in Cancer Research & Treatment 16(1), 5–14 (Feb 2017), publisher: SAGE Publications Inc
[21] Kendall, A., Gal, Y.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? (Oct 2017), arXiv:1703.04977 [cs]
[22] Kim, B., Han, I., Ye, J.C.: DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 347–364. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2022)
[23] Klein, S., Staring, M., Murphy, K., Viergever, M., Pluim, J.: elastix: A Toolbox for Intensity-Based Medical Image Registration. IEEE Transactions on Medical Imaging 29(1), 196–205 (Jan 2010)
[24] Leclerc, S., Smistad, E., Pedrosa, J., Ostvik, A., Cervenansky, F., Espinosa, F., Espeland, T., Berg, E.A.R., Jodoin, P.M., Grenier, T., Lartizien, C., Dhooge, J., Lovstakken, L., Bernard, O.: Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography. IEEE Transactions on Medical Imaging 38(9), 2198–2210 (Sep 2019)
[25] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pp. 10012–10022 (2021)
[26] Ma, T., Dai, X., Zhang, S., Wen, Y.: PIViT: Large Deformation Image Registration with Pyramid-Iterative Vision Transformer. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 602–612. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2023)
[27] Mok, T.C.W., Chung, A.C.S.: Conditional Deformable Image Registration with Convolutional Neural Network. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 35–45. Lecture Notes in Computer Science, Springer International Publishing, Cham (2021)
[28] Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., van der Wilk, M., Glocker, B.: Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty. In: Advances in Neural Information Processing Systems. vol. 33, pp. 12756–12767. Curran Associates, Inc. (2020)
[29] Oliveira, F.P.M.: Medical Image Registration: a Review
[30] Qin, Y., Li, X.: FSDiffReg: Feature-wise and Score-wise Diffusion-guided Unsupervised Deformable Image Registration for Cardiac Images (Jul 2023), arXiv:2307.12035 [cs]
[31] Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Transactions on Medical Imaging 18(8), 712–721 (Aug 1999)
[32] Seitzer, M., Tavakoli, A., Antic, D., Martius, G.: On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks (Apr 2022), arXiv:2203.09168 [cs, stat]
[33] Shi, J., He, Y., Kong, Y., Coatrieux, J.L., Shu, H., Yang, G., Li, S.: XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention (Jun 2022), arXiv:2206.07349 [cs]
[34] Song, J., Meng, C., Ermon, S.: Denoising Diffusion Implicit Models (Oct 2022), arXiv:2010.02502 [cs]
[35] Ta, K., Ahn, S.S., Thorn, S.L., Stendahl, J.C., Zhang, X., Langdon, J., Staib, L.H., Sinusas, A.J., Duncan, J.S.: Multi-task learning for motion analysis and segmentation in 3d echocardiography. IEEE Transactions on Medical Imaging (2024)
[36] Wong, A., Fei, X., Hong, B.W., Soatto, S.: An Adaptive Framework for Learning Unsupervised Depth Completion. IEEE Robotics and Automation Letters 6(2), 3120–3127 (Apr 2021)
[37] Wong, A., Soatto, S.: Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5637–5646. IEEE, Long Beach, CA, USA (Jun 2019)
[38] Zhang, X., Dong, H., Gao, D., Zhao, X.: A comparative study for non-rigid image registration and rigid image registration. arXiv preprint arXiv:2001.03831 (2020)
[39] Zhang, X., Noga, M., Martin, D.G., Punithakumar, K.: Fully automated left atrium segmentation from anatomical cine long-axis mri sequences using deep convolutional neural network with unscented kalman filter. Medical image analysis 68, 101916 (2021)
[40] Zhang, X., Pak, D.H., Ahn, S.S., Li, X., You, C., Staib, L., Sinusas, A.J., Wong, A., Duncan, J.S.: Heteroscedastic uncertainty estimation for probabilistic unsupervised registration of noisy medical images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2024)
[41] Zhang, X., You, C., Ahn, S., Zhuang, J., Staib, L., Duncan, J.: Learning Correspondences of Cardiac Motion from Images Using Biomechanics-Informed Modeling. In: Camara, O., Puyol-Antón, E., Qin, C., Sermesant, M., Suinesiaputra, A., Wang, S., Young, A. (eds.) Statistical Atlases and Computational Models of the Heart. Regular and CMRxMotion Challenge Papers. pp. 13–25. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2022)

Supplementary Materials

Appendix 0.A Dataset details

0.A.1 ACDC [6]

The ACDC dataset, publicly accessible, comprises 2D cardiac MRI scans from 150 patients, with 100 subjects allocated for training and 50 for testing. Each sequence includes frames at end-diastole (ED) and end-systole (ES), along with corresponding myocardium labels. Our training set involves 80 randomly selected patients, the validation set consists of 20 patients, and the testing set comprises 50 patients. Extracting ED and ES image pairs is done in a slice-by-slice manner from the 2D longitudinal stacks. We perform a center crop for each slice pair, resulting in dimensions of $128\times 128$ with respect to the myocardium centroid in the ED frame. This process yields a total of 751 2D image pairs for training, 200 pairs for validation, and an additional 538 pairs for testing.

0.A.2 CAMUS [24]

The CAMUS dataset, available to the public, comprises 2D cardiac ultrasound images from 500 individuals. Each individual contributes two distinct images: one for a 2-chamber view and another for a 4-chamber view. For every image, both end-diastole (ED) and end-systole (ES) frames, along with myocardium segmentation labels, are provided. We crop each image pair to $128\times 128$ , and through random selection, we use 300 subjects for training, 100 subjects for validation, and 100 subjects for testing. This process results in a total of 600 2D image pairs for training, 200 pairs for validation, and an additional 200 pairs for testing.

0.A.3 Private 3D Echo

The private 3D echo dataset contains 99 cardiac ultrasound scans with 8 sequences from synthetic ultrasound, 40 sequences from in vivo canine, and another 51 sequences from in vivo porcine. The details of the acquisition are omitted to preserve anonymity in the review process. ED and ES frames are manually identified and myocardium segmentation labels are provided for each sequence by experienced radiologists. Each 3D image is resized to $64\times 64\times 64$ during training. We randomly selected 60 3D pairs for training, 19 pairs for validation, and another 20 pairs for testing. During testing, the estimated displacement is resized and rescaled to the original volume dimension and we compute anatomical scores of warped and target myocardium volumes afterwards.

Appendix 0.B Implementation details and hyperparameters

We trained all our models using NVIDIA V100/A5000 GPUs with 16/24 GB memory. Training the ACDC/CAMUS datasets requires approximately 1 hour for 300 epochs with a batch size of 8, whereas the private 3D Echo dataset demands around 6 hours for 150 epochs with a batch size of 4. On average, performing inference on each 2D image pair is completed in approximately 0.11 seconds, and each 3D Echo pair takes about 1 second. Both displacement and scoring estimators are trained using the Adam optimizer with a learning rate $1e^{-4}$ . We chose $\beta=1.5$ and $\alpha=0.02$ for the ACDC dataset and $\alpha=0.04$ for the CAMUS dataset as shown in Fig. 6 of the main paper.

Appendix 0.C Additional results on private 3D Echo

We present our qualitative result on the private 3D Echo dataset across three architectures with Fig. 7, where we show that our proposed framework achieves a better registration accuracy evidenced by better matching with the ground truth (yellow overlayed), smoother contour edges and locally consistent myocardial regions. We further show our quantitative results with Tab. 6, where our proposed framework outperforms other baselines on Voxelmorph and Diffusemorph architectures with comparable performance with the vanilla version on Transmorph architecture. This might be due to that the testing size of our private 3D Echo dataset is small with only 20 cases, which we aim to further evaluate on larger in vivo animal datasets.

Table 6: Contour-based metrics compared against baselines on our private 3D Echo dataset. DSC (%), HD (vx), ASD (vx).

Private 3D Echo DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ CNN Voxelmorph [5] 77.23 16.27 2.14 NLL [21] 75.58 18.02 3.11 $\beta$ -NLL [32] 76.80 16.78 2.88 AdaFrame [36] 76.66 16.25 2.11 AdaReg [37] 76.28 17.41 2.17 AdaCS (Ours) 77.33 16.36 2.13 Transformer Transmorph [8] 74.89 17.45 2.14 NLL [21] 70.56 19.60 2.00 $\beta$ -NLL [32] 71.98 17.96 2.15 AdaFrame [36] 75.57 17.30 2.18 AdaReg [37] 74.39 18.14 2.13 AdaCS (Ours) 75.54 17.30 2.16 Diffusion Diffusemorph [22] 73.56 16.51 2.40 NLL [21] 70.36 17.81 2.50 $\beta$ -NLL [32] 69.38 17.13 2.54 AdaFrame [36] 71.47 16.57 2.39 AdaReg [37] 73.32 17.75 2.38 AdaCS (Ours) 73.91 16.74 2.39

We additionally show the qualitative result of the adaptive scoring map on 3D private Echo with Fig. 8 during training. Cross-sectional slices are extracted from the 3D volumes for visualization. We show that our proposed scoring map in the fifth column is able to identify regions with loss of correspondence (e.g. by comparing regions in the first two columns) and adaptive re-weight the error residuals to prevent the displacement estimator from being driven away by the spurious error residuals, leading to performance improvement, consistent to our finding in Sec. 5 from the main paper, where we discuss results of our correspondence scoring map and adaptive weighting (see Fig. 4 in the main paper).

Appendix 0.D Additional results on public 3D dataset

We trained on the OASIS (Learn2Reg 2021) training set with 414 brain 3D MR scans, using corrected images aligned to the template space. Each image was preprocessed to a 160 $\times$ 190 $\times$ 224 1mm isotropic grid. We tested on validation set (skull stripped) and reported results in the Table below, where we consistently improved the baselines in terms of Dice across different anatomies. For the cLapIRN experiments, we finetune the model instead of training from scratch given its computational burden [27].

Table 7: Results on OASIS 3D Dataset [13]. Dice scores (%) are reported for cortex (CX), Subcortical-Gray-Matter (SGM), White-Matter (WM), Cerebrospinal fluid (CSF), and their average.

CX SGM WM CSF AVG Voxelmorph 80.89 68.45 87.35 34.49 67.80 Voxelmorph + AdaCS 83.30 71.52 86.97 35.50 69.32 cLapIRN 88.50 77.90 89.53 49.70 76.41 cLapIRN + AdaCS 89.50 79.85 88.41 49.72 76.87

Appendix 0.E Additional results on c-LapIRN architecture

We compared our method with Elastix and our Voxelmorph-based approach outperforms both as detailed in the left table. Notably, Elastix takes $\approx$ 45s per image pair, making it over 400 times slower than learning-based methods ( $\approx$ 0.11s per pair). To evaluate the effectiveness of our proposed framework on registration approaches that use multiple deformation steps, we initially trained vanilla c-LapIRN (Mok et al., [27]) with NCC loss, but subsequently switched to MSE loss due to it being more stable and performant. Our proposed framework further improved c-LapIRN without needing meticulous hyperparameter tuning, demonstrating its applicability as shown in Tab. 8.

Table 8: Contour-based metrics against baselines on ACDC and CAMUS datasets.

ACDC [6] CAMUS [24] DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ Undeformed 47.98 7.91 2.32 66.77 10.87 2.61 Elastix 77.26 4.95 1.28 80.18 10.02 1.81 vanilla c-LapIRN 54.46 7.33 2.12 68.06 11.53 2.56 c-LapIRN + MSE 70.29 6.41 1.44 73.68 11.93 2.05 c-LapIRN + AdaCS 70.38 6.25 1.43 75.09 11.42 1.96 Voxelmorph 79.48 4.79 1.27 81.50 8.72 1.74 Voxelmorph + AdaCS 80.50 4.69 1.23 81.74 8.55 1.72

Appendix 0.F Loss ablations

We conduct a study of robust losses including NCC, MI, and Tukey’s biweight loss (TBL, with $c=4.6851$ ) in Tab. 9 where our framework consistently improves across various settings.

Table 9: Comparison with other robust loss functions (NCC, MI, TBL).

ACDC CAMUS DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ vxm NCC 78.55 4.94 1.29 77.01 10.23 1.89 MI 78.04 5.25 1.35 78.18 9.83 1.99 TBL 79.31 4.64 1.23 81.18 8.91 1.72 MAE 78.27 5.36 1.43 78.59 10.23 1.97 MSE 79.48 4.79 1.27 81.50 8.72 1.74 AdaCS 80.50 4.69 1.23 81.74 8.55 1.72 tsm NCC 73.77 6.64 1.12 73.03 11.87 1.70 MI 73.57 6.57 1.11 74.83 11.94 1.83 TBL 78.23 5.11 1.27 79.12 9.75 1.84 MAE 74.30 6.36 1.28 75.96 11.35 1.89 MSE 76.94 5.51 1.30 79.24 10.30 1.79 AdaCS 78.39 5.40 1.32 79.64 9.85 1.79 dfm NCC 70.25 5.29 1.58 75.67 10.75 2.06 MI 71.16 5.40 1.56 76.19 10.09 2.16 TBL 69.12 5.73 1.63 76.05 9.54 2.06 MAE 66.30 5.75 1.71 77.30 10.36 2.09 MSE 67.38 5.80 1.67 75.23 9.80 2.07 AdaCS 72.09 5.35 1.53 77.65 9.82 1.99

Appendix 0.G Altenative formulation

We present an analysis of an alternative formulation by taking both reconstructed target $\hat{I_{t}}$ and target $I_{t}$ using the same hyperparameter settings. Our proposed formulation ( $I_{t}$ only) yields generally better performance while being more efficient (e.g. only takes one image), especially in the ACDC dataset as shown in Tab. 10.

Table 10: Comparison with an alternative formulation.

ACDC [6] CAMUS [24] DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ DSC $\uparrow$ HD $\downarrow$ ASD $\downarrow$ vxm $I_{t}$ and $\hat{I_{t}}$ 79.78 4.77 1.25 81.36 8.92 1.76 $I_{t}$ (Ours) 80.50 4.69 1.23 81.74 8.55 1.72 tsm $I_{t}$ and $\hat{I_{t}}$ 76.40 5.56 1.35 79.67 9.88 1.79 $I_{t}$ (Ours) 78.39 5.40 1.32 79.64 9.85 1.79 dfm $I_{t}$ and $\hat{I_{t}}$ 69.05 5.71 1.63 77.16 9.53 2.01 $I_{t}$ (Ours) 72.09 5.35 1.53 77.65 9.82 1.99

Appendix 0.H Limitation

We also present several failure cases when the volume of the myocardium is considerably thin, as shown in Fig. 9. This could stem from the complexity of necessitating precise predictions of displacement for unsupervised methods, a topic we intend to explore in future research. Additionally, though our methods are not super sensitive to hyperparameters as shown in Fig. 6 in the main paper, we still need to perform grid search of each $\alpha$ and $\beta$ for each dataset and architecture. We will explore a more efficient strategy with an amortized hyperparameter optimization in the future.

Appendix 0.I Additional results for visualization

To further illustrate the effectiveness of our proposed method, we present additional qualitative results compared with baselines in Fig. 10.