11institutetext: Biomedical Engineering, Yale University, New Haven, USA
11email: [email protected]
22institutetext: Computer Science, Yale University, New Haven, USA 33institutetext: Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, Canada 44institutetext: Vector Institute, Toronto, Canada 55institutetext: Radiology & Biomedical Imaging, Yale University, New Haven, USA 66institutetext: Electrical Engineering, Yale University, New Haven, USA 77institutetext: Department of Surgery, University of Pennsylvania, Philadelphia, USA 88institutetext: Department of Internal Medicine (Cardiology), Yale University, New Haven, USA

Heteroscedastic Uncertainty Estimation Framework for Unsupervised Registration

Xiaoran Zhang(🖂)🖂{}^{(\text{\Letter})}start_FLOATSUPERSCRIPT ( 🖂 ) end_FLOATSUPERSCRIPT Equal contribution.11    Daniel H. Pak* 11    Shawn S. Ahn 1177   
Xiaoxiao Li
334455
   Chenyu You 66    Lawrence H. Staib 1155   
Albert J. Sinusas
115588
   Alex Wong 22    James S. Duncan 1155
Abstract

Deep learning methods for unsupervised registration often rely on objectives that assume a uniform noise level across the spatial domain (e.g. mean-squared error loss), but noise distributions are often heteroscedastic and input-dependent in real-world medical images. Thus, this assumption often leads to degradation in registration performance, mainly due to the undesired influence of noise-induced outliers. To mitigate this, we propose a framework for heteroscedastic image uncertainty estimation that can adaptively reduce the influence of regions with high uncertainty during unsupervised registration. The framework consists of a collaborative training strategy for the displacement and variance estimators, and a novel image fidelity weighting scheme utilizing signal-to-noise ratios. Our approach prevents the model from being driven away by spurious gradients caused by the simplified homoscedastic assumption, leading to more accurate displacement estimation. To illustrate its versatility and effectiveness, we tested our framework on two representative registration architectures across three medical image datasets. Our method consistently outperforms baselines and produces sensible uncertainty estimates. The code is publicly available at https://voldemort108x.github.io/hetero_uncertainty/.

1 Introduction

Deformable image registration solves for a dense pixel-wise displacement map that aligns one image to another. It is a key medical image analysis technique that enables disease progression monitoring and surgical guidance [8, 18, 24, 31]. Classical methods (e.g. elastic-type models [13], free-form deformation with b-splines [21], diffeomorphic models [1]) are often computationally expensive and impractical for large-scale analyses. Thus, numerous works have proposed training a neural network to accelerate the test-time prediction of the displacement map [3, 5, 9, 11, 32, 30]. Such frameworks commonly utilize an unsupervised objective by minimizing the mean-square error (MSE) between the warped and target images. By using this objective, the existing frameworks assume an additive homoscedastic Gaussian image noise, i.e. ϵ𝒩(0,σ2)similar-toitalic-ϵ𝒩0superscript𝜎2\epsilon\sim\mathcal{N}(0,\sigma^{2})italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), with a constant noise variance across the spatial domain. However, this assumption is problematic for medical images, where the noise is intrinsically heteroscedastic and input-dependent (e.g. MRI [7, 26] or ultrasound [19, 29]). The non-uniform noise variance across the image space (Fig. 1, left) is due to multiple factors including changes in anatomical structures or patient motion artifacts. The simplified homoscedastic assumption disregards such variations in noise levels and results in undesired penalization of false noise-induced outliers, causing unnatural and inaccurate deformations.

To mitigate this, we propose a probabilistic heteroscedastic noise modeling framework for unsupervised image registration (Fig. 1, right). Our method involves two modules: (1) a standard displacement estimator and (2) a variance estimator that predicts the input-specific heteroscedastic noise for each image pair. For proper utilization of the predicted noise, we introduce a novel adaptive weighting strategy based on the relative γ𝛾\gammaitalic_γ-exponentiated signal-to-noise ratio (SNR). This enables the successful convergence of our collaborative training strategy, where the two estimators improve upon each other via information exchange and loss calibration. We validated the effectiveness and versatility of our proposed framework with extensive experiments using two representative neural network architectures and three cardiac datasets. Our proposed framework can be operated in a plug-and-play manner, and provides sensible heteroscedastic uncertainty measures to reflect spatially varying noise. This work may help provide a new perspective in data-driven input noise modeling for deep learning-based unsupervised registration.

Our contributions include (1) We first analyze a naive approach of applying heteroscedastic noise modeling to unsupervised registration and identify the pitfalls of such an approach. (2) From which, we propose a probabilistic framework for data-driven estimation of heteroscedastic uncertainty for unsupervised registration. (3) We introduce an adaptive γ𝛾\gammaitalic_γ-exponentiated relative SNR weighting strategy for proper loss calibration. (4) We demonstrate the effectiveness and versatility of our method on two representative registration architectures across three datasets, demonstrating consistent statistically significant improvements over the baselines while providing sensible uncertainty maps.

2 Related works

2.1 Unsupervised image registration

Voxelmorph is a popular convolutional neural network (CNN) model for unsupervised registration that was trained with the MSE loss, which assumes homoscedastic input noise [3]. Extensions of this work include Voxelmorph-diff [6], Hypermorph [11], and Synthmorph [9] with similar assumptions. Transformers have also been explored for unsupervised registration by replacing the convolutional encoders in the Voxelmorph U-Net with swin transformer layers [5]. Transformers can increase the receptive field of the encoding branch, potentially improving the long-range modeling capability [15]. Further extensions include replacing the convolutional decoders with transformer layers [23] and introducing multi-scale pyramids [16].

In this paper, we selected Voxelmorph [3] and Transmorph [5] as the representative registration architectures and tested our proposed framework and other baselines on both across different datasets. We also compared Voxelmorph-diff [6] with a direct extension of our framework that incorporates the displacement uncertainty, in addition to our main contribution of modeling image uncertainty. The detailed analysis can be found in the Supp. Mat.

2.2 Heteroscedastic uncertainty estimation

To model the heteroscedastic uncertainty in imaging problems, Kendall et al. [12] proposed a joint mean-and-variance optimization strategy by minimizing the negative log-likelihood (NLL) under the maximum likelihood estimation (MLE) framework. This formulation has been effective in many applications, such as surface normal estimation [2] and image segmentation [17]. Seitzer et al. further advanced the formulation with β𝛽\betaitalic_β-NLL to address the undesired undersampling effects of the inverse variance weighting of the data fidelity term [22]. In this paper, we selected both NLL [12] and β𝛽\betaitalic_β-NLL [22] as our baselines for heteroscedastic uncertainty estimation.

2.3 Adaptive weighting schemes

A wide range of imaging problems is cast into optimizing an energy function, which contains a data-fidelity term and a regularization term. The relative importance between the two terms is usually tuned via hyperparameter optimization. However, such a strategy disregards the heteroscedastic nature of the error residuals [27]. Several adaptive weighting schemes have been proposed to address this issue by adjusting the weighting throughout optimization [10, 28]. Wong et al. [27] extended the formulation to multi-frame setting. In this paper, we selected AdaReg [28] and AdaFrame [27] as baselines to compare with our proposed adaptive signal-to-noise weighting scheme.

3 Analysis of a naive approach

Let I:Ω:𝐼ΩI:\Omega\rightarrow\mathbb{R}italic_I : roman_Ω → blackboard_R be an image and ΩdΩsuperscript𝑑\Omega\subseteq\mathbb{R}^{d}roman_Ω ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT its spatial domain. Unsupervised image registration aims to find the alignment between a moving image Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and a fixed image Ifsubscript𝐼𝑓I_{f}italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, without ground truth information about warping displacement field z:Ωd:𝑧Ωsuperscript𝑑z:\Omega\rightarrow\mathbb{R}^{d}italic_z : roman_Ω → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. For each image pair (Im,If)subscript𝐼𝑚subscript𝐼𝑓(I_{m},I_{f})( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ), we assume Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a fixed parameter and Ifsubscript𝐼𝑓I_{f}italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is a noisy observation of the warped image I^f=Imzsubscript^𝐼𝑓subscript𝐼𝑚𝑧\hat{I}_{f}=I_{m}\circ zover^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ italic_z. Then, z𝑧zitalic_z is a sample from the posterior distribution p(z|If;Im)𝑝conditional𝑧subscript𝐼𝑓subscript𝐼𝑚p(z|I_{f};I_{m})italic_p ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).

With this formulation, the main task is finding the displacement posterior p(z|If;Im)𝑝conditional𝑧subscript𝐼𝑓subscript𝐼𝑚p(z|I_{f};I_{m})italic_p ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) that will maximize the data likelihood p(If;Im)𝑝subscript𝐼𝑓subscript𝐼𝑚p(I_{f};I_{m})italic_p ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Since directly solving for p(z|If;Im)𝑝conditional𝑧subscript𝐼𝑓subscript𝐼𝑚p(z|I_{f};I_{m})italic_p ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is intractable, we apply the variational approach to approximate it using a neural network qθ(z|If;Im)subscript𝑞𝜃conditional𝑧subscript𝐼𝑓subscript𝐼𝑚q_{\theta}(z|I_{f};I_{m})italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). To minimize the discrepancy between approximate and real posterior distributions, we compute the KL-divergence and arrive at the negative evidence lower bound:

KL(qθ(z|If;Im)p(z|If;Im))=KL(qθ(z|If;Im)p(z))𝔼qθ(logp(If|z;Im))+const.\displaystyle\begin{split}&\text{KL}(q_{\theta}(z|I_{f};I_{m})\|p(z|I_{f};I_{m% }))=\text{KL}(q_{\theta}(z|I_{f};I_{m})\|p(z))\\ &\qquad\qquad\qquad\qquad\;\;\;-\mathbb{E}_{q_{\theta}}(\log p(I_{f}|z;I_{m}))% +\text{const.}\end{split}start_ROW start_CELL end_CELL start_CELL KL ( italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ italic_p ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) = KL ( italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ italic_p ( italic_z ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_log italic_p ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_z ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) + const. end_CELL end_ROW (1)

For the warped image distribution p(If|z;Im)𝑝conditionalsubscript𝐼𝑓𝑧subscript𝐼𝑚p(I_{f}|z;I_{m})italic_p ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_z ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), we assume an additive zero-mean heteroscedastic Gaussian noise with spatially varying variance σI2:Ω+:superscriptsubscript𝜎𝐼2Ωsuperscript\sigma_{I}^{2}:\Omega\rightarrow\mathbb{R}^{+}italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : roman_Ω → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT to incorporate input-dependent heteroscedasticity

p(If|z;Im)=𝒩(If;I^f,σI2).𝑝conditionalsubscript𝐼𝑓𝑧subscript𝐼𝑚𝒩subscript𝐼𝑓subscript^𝐼𝑓superscriptsubscript𝜎𝐼2\displaystyle p(I_{f}|z;I_{m})=\mathcal{N}(I_{f};\hat{I}_{f},\sigma_{I}^{2}).italic_p ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_z ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = caligraphic_N ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (2)

Similar to [3, 12], we assume pixel-wise independence, isotropic Gaussian for the displacement posterior qθ(z|If;Im)=𝒩(z;μz,σz2𝕀)subscript𝑞𝜃conditional𝑧subscript𝐼𝑓subscript𝐼𝑚𝒩𝑧subscript𝜇𝑧superscriptsubscript𝜎𝑧2𝕀q_{\theta}(z|I_{f};I_{m})=\mathcal{N}(z;\mu_{z},\sigma_{z}^{2}\mathbb{I})italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z ; italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I ), and uniform Gaussian for the displacement prior p(z)=𝒩(0,𝕀)𝑝𝑧𝒩0𝕀p(z)=\mathcal{N}(0,\mathbb{I})italic_p ( italic_z ) = caligraphic_N ( 0 , blackboard_I ). Then, we arrive at the preliminary loss to be minimized for each image pair

=𝔼Ω[1σI2[IfI^f]2data+logσI2+λz2],\mathcal{L}=\mathbb{E}_{\Omega}\left[\underbrace{\frac{1}{\sigma_{I}^{2}}[I_{f% }-\hat{I}_{f}\|]^{2}}_{\mathcal{L}_{\text{data}}}+\log\sigma_{I}^{2}+\lambda\|% \nabla z\|^{2}\right],caligraphic_L = blackboard_E start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_log italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ ∇ italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (3)

where I^f=Imzsubscript^𝐼𝑓subscript𝐼𝑚𝑧\hat{I}_{f}=I_{m}\circ zover^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ italic_z. σI2superscriptsubscript𝜎𝐼2\sigma_{I}^{2}italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is estimated by our variance estimator as σ^I2=hϕ(If,I^f)superscriptsubscript^𝜎𝐼2subscriptitalic-ϕsubscript𝐼𝑓subscript^𝐼𝑓\hat{\sigma}_{I}^{2}=h_{\phi}(I_{f},\hat{I}_{f})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ), using the fixed and moved image pair as input. z𝑧zitalic_z is estimated by our displacement estimator as z^=qθ(Im,If)^𝑧subscript𝑞𝜃subscript𝐼𝑚subscript𝐼𝑓\hat{z}=q_{\theta}(I_{m},I_{f})over^ start_ARG italic_z end_ARG = italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ), using the moving and fixed pair as input.

4 Our approach

4.1 Collaborative heteroscedastic uncertainty estimation

Thus far, the formulation implies a single objective (Eq. 3) for the optimization of both displacement and variance estimators. Even with perfectly estimated variance, such naïve application may result in undersampling of higher intensity regions [22], as their noise levels tend to be elevated due to the relatively preserved SNR throughout the image space. To mitigate this, we propose a learning strategy that utilizes separate objectives for the two modules, but with collaborative information exchange (Fig. 1). Specifically, the displacement estimator is optimized using Eq. 4, which largely depends on the quality of the predicted variance. In parallel, the variance estimator directly uses the displacement estimator output (i.e. the warped image) as a part of its input to predict the variance.

Refer to caption Refer to caption
Figure 1: Left: We propose a heteroscedastic uncertainty estimation scheme to adaptively weight the data-fidelity term accounting for the non-uniform variations of noise across the image. Right: Overview of our proposed method. The noise variance estimator uses a U-Net backbone that takes reconstructed frame I^fsubscript^𝐼𝑓\hat{I}_{f}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT along with frame Ifsubscript𝐼𝑓I_{f}italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to predict the heteroscedastic variance for the noise in Eq. 2.

4.2 Adaptive weighting using exponentiated relative SNR

Our proposed formulation above offers the flexibility of designing separate objectives for the displacement and variance estimators. For the displacement loss, we addressed the undersampling issue by modifying the image fidelity weighting term from the inverse of absolute variance to the γ𝛾\gammaitalic_γ-exponentiated relative SNR. Our proposed displacement loss is defined as

θ=𝔼Ω[𝒯[(Ifσ^I)2γ]IfI^f22+λz^2],subscript𝜃subscript𝔼Ωdelimited-[]𝒯delimited-[]superscriptsubscript𝐼𝑓subscript^𝜎𝐼2𝛾superscriptsubscriptnormsubscript𝐼𝑓subscript^𝐼𝑓22𝜆superscriptnorm^𝑧2\displaystyle\mathcal{L}_{\theta}=\mathbb{E}_{\Omega}\left[\mathcal{T}\left[% \left(\frac{I_{f}}{\lfloor\hat{\sigma}_{I}\rfloor}\right)^{2\gamma}\right]\|I_% {f}-\hat{I}_{f}\|_{2}^{2}+\lambda\|\nabla\hat{z}\|^{2}\right],caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ caligraphic_T [ ( divide start_ARG italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ⌊ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ⌋ end_ARG ) start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ] ∥ italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ ∇ over^ start_ARG italic_z end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where 𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ) is the sigmoid activation function. The hyperparameter γ𝛾\gammaitalic_γ indicates the confidence of the displacement model on the estimated variance; when γ=0𝛾0\gamma=0italic_γ = 0 the image fidelity term reduces to MSE. Gradient stopping (denoted as \lfloor\cdot\rfloor⌊ ⋅ ⌋) was necessary to treat the weighting term as a scalar map and to prevent duplicate back-propagation by the two estimators.

For optimizing the variance estimator, we used the β𝛽\betaitalic_β-NLL objective [22]

ϕ=𝔼Ω[σ^I2β(1σ^I2IfI^f22+logσ^I2)].subscriptitalic-ϕsubscript𝔼Ωdelimited-[]superscriptsubscript^𝜎𝐼2𝛽1superscriptsubscript^𝜎𝐼2superscriptsubscriptnormsubscript𝐼𝑓subscript^𝐼𝑓22superscriptsubscript^𝜎𝐼2\displaystyle\mathcal{L}_{\phi}=\mathbb{E}_{\Omega}\left[\lfloor\hat{\sigma}_{% I}^{2\beta}\rfloor\left(\frac{1}{\hat{\sigma}_{I}^{2}}\|I_{f}-\lfloor\hat{I}_{% f}\rfloor\|_{2}^{2}+\log\hat{\sigma}_{I}^{2}\right)\right].caligraphic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ ⌊ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT ⌋ ( divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - ⌊ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⌋ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] . (5)

Our network was implemented to output logσ^I2superscriptsubscript^𝜎𝐼2\log\hat{\sigma}_{I}^{2}roman_log over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for numerical stability. We updated each estimator in an alternating fashion using two separate optimizers.

5 Datasets and results

5.1 Datasets

We tested our method on three distinctive cardiac datasets: (1) ACDC [4]: Human MRI, multiple 2D slices + time, 150 total sequences. (2) CAMUS [14]: Human echocardiography, 2D + time, 1000 total sequences from 500 patients, (3) Private 3D Echo: in vivo Porcine, in vivo canine, and synthetic echocardiography, 3D + time, 99 total sequences. All datasets included segmentations of the left ventricular myocardium at end-diastole (ED) and end-systole (ES). For all of our experiments, we warped the ED frame to reconstruct the ES frame. The training, validation, and testing splits were 60/20/20(%). Images were preprocessed with resizing and normalization (Supp. Mat.). Implementation details (hyperparameters, compute, etc.) are in the Supp. Mat.

5.2 Registration accuracy

5.2.1 Quantitative and qualitative evaluations.

We present extensive quantitative evaluations of our method performance, where we show a consistent improvement from both classical and deep learning baselines across all datasets (Table 1, 3) with statistical significance (Table 2). Qualitative visualizations also demonstrate improved registration performance of our approach with smoother contour edges and more locally consistent myocardium (Fig. 2).

Table 1: Contour-based metrics compared against baselines. Units: DSC (%) HD (px) ASD (px). Our method consistently improves across different architectures and datasets. The second-best method is highlighted with \dagger.

ACDC [4] CAMUS [14] DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow Undeformed 47.98 7.91 2.32 66.77 10.87 2.61 Elastix [13] 77.26 4.95 1.28 80.18 10.02 1.81 CNN vxm (NCC) [3] 78.55 4.94 1.29 77.01 10.23 1.89 vxm (MI) [3] 78.04 5.25 1.35 78.18 9.83 1.99 vxm (MSE) [3] \dagger 80.20 4.64 1.24 81.76 8.93 1.70 NLL [12] 76.49 5.46 1.45 75.24 11.05 2.20 β𝛽\betaitalic_β-NLL [22] 78.74 5.07 1.33 79.75 9.39 1.93 AdaFrame [27] 66.38 5.80 1.67 77.88 10.54 1.93 AdaReg [28] 78.75 5.13 1.33 79.31 9.78 1.88 Ours 80.73 4.57 1.21 81.96 8.80 1.66 Transformer tsm (NCC) [5] 73.77 6.64 1.12 73.03 11.87 1.70 tsm (MI) [5] 73.57 6.57 1.11 74.83 11.94 1.83 tsm (MSE) [5] \dagger 76.94 5.51 1.30 79.24 10.30 1.79 NLL [12] 73.12 7.22 1.27 75.08 11.60 1.79 β𝛽\betaitalic_β-NLL [22] 75.74 6.12 1.29 77.39 10.99 1.86 AdaFrame [27] 67.95 5.72 1.59 78.06 9.86 1.91 AdaReg [28] 76.22 5.68 1.29 78.12 10.62 1.84 Ours 78.12 5.04 1.26 80.38 9.86 1.72

Table 2: Paired t-tests of ours v.s. \dagger in Table 1 in terms of DSC.

Voxelmorph Transmorph ACDC p=0.02𝑝0.02p=0.02italic_p = 0.02 p=1.02×1041𝑝1.02superscript1041p=1.02\times 10^{-41}italic_p = 1.02 × 10 start_POSTSUPERSCRIPT - 41 end_POSTSUPERSCRIPT CAMUS p=1.36×106𝑝1.36superscript106p=1.36\times 10^{-6}italic_p = 1.36 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT p=0.005𝑝0.005p=0.005italic_p = 0.005

Table 3: Quantitative evaluation on 3D private Echo dataset using vxm-based architecture.

Private 3D Echo DSC \uparrow HD \downarrow ASD \downarrow Voxelmorph [3] 74.61 5.59 0.94 NLL [12] 72.59 6.35 0.97 β𝛽\betaitalic_β-NLL [22] 73.81 5.77 0.96 AdaFrame [27] 73.01 5.74 0.94 AdaReg [28] 73.41 5.91 0.95 Ours 75.04 5.55 0.93

Table 4: Effect of γ𝛾\gammaitalic_γ in Eq. 4 for σ^I2superscriptsubscript^𝜎𝐼2\hat{\sigma}_{I}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT estimation.

ACDC [4] CAMUS [14] DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow Ours (γ=0.25𝛾0.25\gamma=0.25italic_γ = 0.25) 79.74 4.74 1.26 82.07 8.53 1.65 Ours (γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5) 80.73 4.57 1.21 81.96 8.80 1.66 Ours (γ=0.75𝛾0.75\gamma=0.75italic_γ = 0.75) 80.00 4.69 1.24 81.82 8.45 1.66 Ours (γ=1𝛾1\gamma=1italic_γ = 1) 79.78 4.71 1.25 81.31 9.08 1.69

Refer to caption
Refer to caption
Figure 2: Qualitative evaluation of the registration accuracy via segmentation warping for all datasets (top two rows: Voxelmorph architecture [3], bottom two rows: Transmorph architecture [5]). Our method in the last column (overlayed with ground truth (GT) ES myocardium label in yellow) predicts more natural and accurate deformations compared to baselines, evidenced by better matching with the GT, smoother contour edges, and locally consistent myocardial region.

Our consistent improvements from the vanilla version of Voxelmorph (vxm) [3] and Transmorph (tsm) [5], which both utilize homoscedastic noise assumptions, especially confirm the benefit of heteroscedastic noise modeling (Table 1).

We note that both NLL [12] and β𝛽\betaitalic_β-NLL failed to improve upon the vanilla baselines (Table 1). This validates our analysis that the joint training of displacement and variance estimators would degrade registration performance due to the undersampling from inverse absolute variance weighting (Section 4). This also shows the advantage of our proposed collaborative learning framework, which provides the flexibility of designing separate objectives for the two estimators. Furthermore, we observe that existing adaptive weighting schemes of AdaReg [28] and AdaFrame [27] are ineffective, potentially due to the their assumptions on error residuals failing to represent the complicated real-world data distribution; this is in contrast to our proposed data-driven SNR-based weighting scheme. For 3D echo data, we show quantitative results in Table 3 and qualitative results in the Supp. Mat. Note: Improvements come solely from smoother optimization and does not incur additional complexity during inference.

5.2.2 Effect of γ𝛾\gammaitalic_γ.

The exponentiated hyperparameter γ𝛾\gammaitalic_γ provides the flexibility to adjust the estimated variance’s degree of influence on the displacement estimator during training. This can be viewed as a trade-off in the predicted uncertainty estimates, where γ=0𝛾0\gamma=0italic_γ = 0 means no confidence, reducing the adaptive weighting map to a uniform scalar map, and γ=1𝛾1\gamma=1italic_γ = 1 indicating full confidence. We empirically chose γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 for the best performance across all datasets (Table 4).

5.3 Evaluation on heteroscedastic uncertainty

We utilize sparsification error plots [20] to provide quantitative measure of accuracy for our estimated variance σ^I2superscriptsubscript^𝜎𝐼2\hat{\sigma}_{I}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To obtain the plots, we removed one pixel at a time from largest to smallest uncertainty magnitudes, and measured the MSE of the remaining pixels (𝔼ΩΩx[IfI^f]2subscript𝔼ΩsubscriptΩ𝑥superscriptdelimited-[]subscript𝐼𝑓subscript^𝐼𝑓2\mathbb{E}_{\Omega-\Omega_{x}}[I_{f}-\hat{I}_{f}]^{2}blackboard_E start_POSTSUBSCRIPT roman_Ω - roman_Ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) [25]. An ideal sparsification error plot should monotonically decrease, which would indicate that the estimated uncertainty map is able to correctly identify pixels with the largest errors. Fig. 3 shows our estimated uncertainty is sensible across different datasets – based on the overall shape of the plots and the Area Under the Sparsification Error (AUSE) metrics – with similar calibration levels compared to β𝛽\betaitalic_β-NLL [22] and better calibration than NLL [12]. This illustrates the effectiveness of our variance estimator and optimization framework.

Refer to caption Refer to caption
Refer to caption
Figure 3: Left: Estimated σ^I2superscriptsubscript^𝜎𝐼2\hat{\sigma}_{I}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the corresponding weighting map of (top row: ACDC [4]; bottom row: CAMUS [14]). Right: Sparsification error plots of σ^I2superscriptsubscript^𝜎𝐼2\hat{\sigma}_{I}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Both plots are from our proposed framework under Voxelmorph architecture [3]

We further provide a qualitative visualization on both datasets, where we demonstrate that our estimated σ^Isubscript^𝜎𝐼\hat{\sigma}_{I}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT provides a sensible heteroscedastic noise variance map between the fixed image and reconstructed image according to our assumption (Fig. 3). From the figure, our estimated noise variance shown in the third column reflects the intensity mismatches of the corresponding regions between the fixed image Ifsubscript𝐼𝑓I_{f}italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in the first column and our reconstructed/moved image (I^fsubscript^𝐼𝑓\hat{I}_{f}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT) in the second column. This corroborates the validity of our heteroscedastic noise assumption (Eq. 2) and shows the effectiveness of our proposed variance estimator. Furthermore, our computed adaptive weighting maps (last column, Fig. 3) accurately reflect the relative importance based on SNR, as evidenced by the highlighting of regions with potential imaging artifacts. Nonetheless, we do have failure cases, as shown in the Supp. Mat., where the myocardium (typically challenging due to irregular volumes) is considerably thin.

6 Conclusion

We propose a heteroscedastic uncertainty estimation framework for probabilistic unsupervised image registration to adaptively weight the displacement estimation with relative γ𝛾\gammaitalic_γ-exponentiated signal-to-noise, which improves registration performance from the commonly used homoscedastic assumption while also providing accurate and sensible uncertainty measures. Our proposed framework consists of a displacement and a variance estimator, optimized under an alternating collaborative strategy. We demonstrate the effectiveness and versatility of our proposed framework on two representative registration architectures across diverse cardiac datasets and show consistent statistically significant improvements over other baselines. Though our proposed framework is promising, it still relies on a manually crafted adaptive map on the data fidelity term, which might not be able to fully reflect the complicated characteristics of large-scale real-world data. Future work will aim to explore a more data-driven objective with further validation of clinical datasets for more potential impact.

{credits}

6.0.1 Acknowledgements

This work is supported by NIH grant R01HL121226.

6.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • [1] Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1), 95–113 (Oct 2007)
  • [2] Bae, G., Budvytis, I., Cipolla, R.: Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13117–13126. IEEE, Montreal, QC, Canada (Oct 2021)
  • [3] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging 38(8), 1788–1800 (Aug 2019)
  • [4] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging 37(11), 2514–2525 (2018)
  • [5] Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y.: TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis 82, 102615 (Nov 2022)
  • [6] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57, 226–236 (Oct 2019)
  • [7] Eklund, A., Lindquist, M.A., Villani, M.: A Bayesian heteroscedastic GLM with application to fMRI data with motion spikes. NeuroImage 155, 354–369 (Jul 2017)
  • [8] Hill, D.L.G., Batchelor, P.G., Holden, M., Hawkes, D.J.: Medical image registration
  • [9] Hoffmann, M., Billot, B., Greve, D.N., Iglesias, J.E., Fischl, B., Dalca, A.V.: SynthMorph: learning contrast-invariant registration without acquired images. IEEE Transactions on Medical Imaging 41(3), 543–558 (Mar 2022), arXiv:2004.10282 [cs, eess, q-bio]
  • [10] Hong, B.W., Koo, J.K., Burger, M., Soatto, S.: Adaptive Regularization of Some Inverse Problems in Image Analysis (May 2017), arXiv:1705.03350 [cs]
  • [11] Hoopes, A., Hoffmann, M., Fischl, B., Guttag, J., Dalca, A.V.: HyperMorph: Amortized Hyperparameter Learning for Image Registration (May 2021), arXiv:2101.01035 [cs, eess]
  • [12] Kendall, A., Gal, Y.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? (Oct 2017), arXiv:1703.04977 [cs]
  • [13] Klein, S., Staring, M., Murphy, K., Viergever, M., Pluim, J.: elastix: A Toolbox for Intensity-Based Medical Image Registration. IEEE Transactions on Medical Imaging 29(1), 196–205 (Jan 2010)
  • [14] Leclerc, S., Smistad, E., Pedrosa, J., Østvik, A., Cervenansky, F., Espinosa, F., Espeland, T., Berg, E.A.R., Jodoin, P.M., Grenier, T., et al.: Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE transactions on medical imaging 38(9), 2198–2210 (2019)
  • [15] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pp. 10012–10022 (2021)
  • [16] Ma, T., Dai, X., Zhang, S., Wen, Y.: Pivit: Large deformation image registration with pyramid-iterative vision transformer. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 602–612. Springer (2023)
  • [17] Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., van der Wilk, M., Glocker, B.: Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty. In: Advances in Neural Information Processing Systems. vol. 33, pp. 12756–12767. Curran Associates, Inc. (2020)
  • [18] Oliveira, F.P.M.: Medical Image Registration: a Review
  • [19] Ouzir, N., Ollila, E., Vorobyov, S.A.: Data-Adaptive Similarity Measures for B-mode Ultrasound Images Using Robust Noise Models. IEEE Journal of Selected Topics in Signal Processing 14(6), 1244–1254 (Oct 2020), conference Name: IEEE Journal of Selected Topics in Signal Processing
  • [20] Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the Uncertainty of Self-Supervised Monocular Depth Estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3224–3234. IEEE, Seattle, WA, USA (Jun 2020)
  • [21] Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Transactions on Medical Imaging 18(8), 712–721 (Aug 1999)
  • [22] Seitzer, M., Tavakoli, A., Antic, D., Martius, G.: On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks (Apr 2022), arXiv:2203.09168 [cs, stat]
  • [23] Shi, J., He, Y., Kong, Y., Coatrieux, J.L., Shu, H., Yang, G., Li, S.: XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention (Jun 2022), arXiv:2206.07349 [cs]
  • [24] Ta, K., Ahn, S.S., Thorn, S.L., Stendahl, J.C., Zhang, X., Langdon, J., Staib, L.H., Sinusas, A.J., Duncan, J.S.: Multi-task learning for motion analysis and segmentation in 3d echocardiography. IEEE Transactions on Medical Imaging (2024)
  • [25] Truong, P., Danelljan, M., Van Gool, L., Timofte, R.: Learning Accurate Dense Correspondences and When to Trust Them (Apr 2021), arXiv:2101.01710 [cs]
  • [26] Wegmann, B., Eklund, A., Villani, M.: Bayesian heteroscedastic regression for diffusion tensor imaging. In: Modeling, Analysis, and Visualization of Anisotropy. pp. 257–282. Springer (2017)
  • [27] Wong, A., Fei, X., Hong, B.W., Soatto, S.: An Adaptive Framework for Learning Unsupervised Depth Completion. IEEE Robotics and Automation Letters 6(2), 3120–3127 (Apr 2021)
  • [28] Wong, A., Soatto, S.: Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5637–5646. IEEE, Long Beach, CA, USA (Jun 2019)
  • [29] Zhang, J., He, Q., Xiao, Y., Zheng, H., Wang, C., Luo, J.: Ultrasound image reconstruction from plane wave radio-frequency data by self-supervised deep neural network. Medical Image Analysis 70, 102018 (May 2021)
  • [30] Zhang, X., Dong, H., Gao, D., Zhao, X.: A comparative study for non-rigid image registration and rigid image registration. arXiv preprint arXiv:2001.03831 (2020)
  • [31] Zhang, X., Noga, M., Martin, D.G., Punithakumar, K.: Fully automated left atrium segmentation from anatomical cine long-axis mri sequences using deep convolutional neural network with unscented kalman filter. Medical image analysis 68, 101916 (2021)
  • [32] Zhang, X., You, C., Ahn, S., Zhuang, J., Staib, L., Duncan, J.: Learning Correspondences of Cardiac Motion from Images Using Biomechanics-Informed Modeling. In: Camara, O., Puyol-Antón, E., Qin, C., Sermesant, M., Suinesiaputra, A., Wang, S., Young, A. (eds.) Statistical Atlases and Computational Models of the Heart. Regular and CMRxMotion Challenge Papers. pp. 13–25. Lecture Notes in Computer Science, Springer Nature Switzerland, Cham (2022)
Dataset and implementation details
  • ACDC (2D MRI) [4]. We randomly selected 80 patients for training, 20 for validation, and 50 for testing. ED and ES image pairs are extracted from each sequence in a slice-by-slice manner from the longitudinal stacks. We center crop each slice pair to 128×128128128128\times 128128 × 128 w.r.t. myocardium centroid, yielding 751 image pairs for training, 200 for validation, and another 538 for testing.

  • CAMUS (2D Echo) [14]. We resize each image pair to size 128×128128128128\times 128128 × 128 and randomly select 300 subjects for training, 100 subjects for validation, and 100 subjects for testing. This yields in total 600 image pairs for training, 200 pairs for validation, and 200 pairs for testing.

  • Private 3D Echo. The private 3D echo dataset contains 99 cardiac ultrasound scans. ED and ES frames are manually identified and myocardium segmentation labels are provided for each sequence by experienced radiologists. Each 3D image is resized to 64×64×6464646464\times 64\times 6464 × 64 × 64. We randomly select 60 3D pairs for training, 19 pairs for validation, and another 20 pairs for testing.

  • Implementation details. All our experiments are conducted under the Pytorch framework and trained on NVIDIA V100/A5000 GPUs. The architecture of the variance estimator is implemented based on a U-Net. We use λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01 as the hyperparameter in Eq. 4. Both displacement and variance estimators are trained with learning rates 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 300 epochs.

Incorporating displacement uncertainty.

To further demonstrate the versatility, we conducted a direct extension by simultaneously estimating heteroscedastic displacement uncertainty with the isotropic assumption. We add an additional layer in the displacement estimator to predict σ^zsubscript^𝜎𝑧\hat{\sigma}_{z}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, where σ^z(x)subscript^𝜎𝑧𝑥\hat{\sigma}_{z}(x)\in\mathbb{R}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x ) ∈ blackboard_R and the original prediction as displacement mean μ^zsubscript^𝜇𝑧\hat{\mu}_{z}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We train our proposed displacement estimator using objective:

θ=𝔼Ω[𝒯[(Ifσ^I)2γ][IfI^f]2+α(σ^z2logσ^z2)+λz^2]subscript𝜃subscript𝔼Ωdelimited-[]𝒯delimited-[]superscriptsubscript𝐼𝑓subscript^𝜎𝐼2𝛾superscriptdelimited-[]subscript𝐼𝑓subscript^𝐼𝑓2𝛼superscriptsubscript^𝜎𝑧2superscriptsubscript^𝜎𝑧2𝜆superscriptnorm^𝑧2\mathcal{L}_{\theta}=\mathbb{E}_{\Omega}\left[\mathcal{T}\left[\left(\frac{I_{% f}}{\lfloor\hat{\sigma}_{I}\rfloor}\right)^{2\gamma}\right][I_{f}-\hat{I}_{f}]% ^{2}+\alpha\left(\hat{\sigma}_{z}^{2}-\log\hat{\sigma}_{z}^{2}\right)+\lambda% \|\nabla\hat{z}\|^{2}\right]caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ caligraphic_T [ ( divide start_ARG italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ⌊ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ⌋ end_ARG ) start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ] [ italic_I start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_log over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_λ ∥ ∇ over^ start_ARG italic_z end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] derived from Eq. 1, with z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG sampled from distribution z^𝒩(μ^z,σ^z2𝕀)similar-to^𝑧𝒩subscript^𝜇𝑧superscriptsubscript^𝜎𝑧2𝕀\hat{z}\sim\mathcal{N}(\hat{\mu}_{z},\hat{\sigma}_{z}^{2}\mathbb{I})over^ start_ARG italic_z end_ARG ∼ caligraphic_N ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I ) during training with re-parameterization trick. We compare the quality of our predicted displacement along with its uncertainty estimate σ^z2superscriptsubscript^𝜎𝑧2\hat{\sigma}_{z}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with vxm-diff [6].

Refer to caption
Figure 4: Comparison of σ^z2superscriptsubscript^𝜎𝑧2\hat{\sigma}_{z}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT between our vxm-based framework and vxm-diff [6].
Table 5: Our method raises the upper bound on registration accuracy while providing useful displacement uncertainty estimates σ^zsubscript^𝜎𝑧\hat{\sigma}_{z}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT.

Uncertainty ACDC [4] CAMUS [14] σz2superscriptsubscript𝜎𝑧2\sigma_{z}^{2}italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT σI2superscriptsubscript𝜎𝐼2\sigma_{I}^{2}italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT DSC \uparrow HD \downarrow ASD \downarrow DSC \uparrow HD \downarrow ASD \downarrow Vxm [3] 80.20 4.64 1.24 81.76 8.93 1.70 Vxm-diff [6] 76.19 5.75 1.19 76.74 10.76 1.88 Ours 79.80 4.74 1.22 81.47 8.67 1.69 Ours 80.73 4.57 1.21 81.96 8.80 1.66 Ours 79.87 4.62 1.20 81.91 8.54 1.65

We present our quantitative results in Table 5, illustrating the superiority of our formulation. We further present the qualitative visualization as shown in Fig. 4, demonstrating that our estimated heteroscedastic uncertainty σ^z2superscriptsubscript^𝜎𝑧2\hat{\sigma}_{z}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT accurately captures the randomness in the displacement prediction more accurately.

Additional private 3D Echo results.

We present our qualitative result in Fig. 5 for registration accuracy and left Fig. 6 for noise heteroscedastic variance evaluation. We also quantitatively evaluate the result by repeating the sparsification error plot similar to Section 5.3. We observe that our predicted σ^Isubscript^𝜎𝐼\hat{\sigma}_{I}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT achieves a better error curve than β𝛽\betaitalic_β-NLL and NLL, which is consistent with our main results shown in Fig. 3.

Refer to caption
Figure 5: Qualitative evaluation for our private 3D Echo dataset on voxelmorph architecture. We extract cross-sectional slices from the 3D volume for visualization. We overlay ground truth segmentation in yellow for comparison.
Refer to caption Refer to caption
Figure 6: Left: Estimated σ^I2superscriptsubscript^𝜎𝐼2\hat{\sigma}_{I}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the corresponding weighting map of our proposed framework under Voxelmorph architecture [3] using our private 3D Echo dataset. Right: Sparsification error plots of logσ^I2superscriptsubscript^𝜎𝐼2\log\hat{\sigma}_{I}^{2}roman_log over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on our private 3D Echo dataset.
Failure case.

We present an example shown in Fig. 7 that all methods fail to match myocardium when it is considerably thin.

Refer to caption
Figure 7: An example of failure case on ACDC dataset.