Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yungeng Zhang, Yuru Pei, Hongbin Zha

Abstract

Diffeomorphic registration is widely used in medical image processing with the invertible and one-to-one mapping between images. Recent progress has been made to diffeomorphic registration by utilizing a convolutional neural network for efficient and end-to-end inference of registration fields from an image pair. However, existing deep learning-based registration models neglect to employ attention mechanisms to handle the long-range cross-image relevance in embedding learning, limiting such approaches to identify the semantically meaningful correspondence of anatomical structures. In this paper, we propose a novel dual transformer network (DTN) for diffeomorphic registration, consisting of a learnable volumetric embedding module, a dual cross-image relevance learning module for feature enhancement, and a registration field inference module. The self-attention mechanisms of DTN explicitly model both the inter- and intra-image relevances in the embedding from both the separate and concatenated volumetric images, facilitating semantical correspondence of anatomical structures in diffeomorphic registration. Extensive quantitative and qualitative evaluations demonstrate that the DTN performs favorably against state-of-the-art methods.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_13

SharedIt: https://rdcu.be/cyhPU

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper propose a novel dual transformer network architecture for image registration. The aim of the dual transformer network is to better capture and model the inter- and intra-image relevance. The approach is focused on mono-modality registration and is evaluated on a brain MR dataset (OASIS) which includes 450 subjects.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The idea of using a transformer approach for embedding the images and to capture intra- and inter-image dependencies is novel as far as I can see.
- The transformer approach is combined with other SOTA registration components (SVF deformation model, etc) to yield a convincing registration technique with quite reasonable performance
- The paper is clearly written and easy to follow
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The evaluation on inter-subject brain image registration using the Dice metric as the primary evaluation metric is quite limiting
- The evaluation doesn’t demonstrate any clinical usefulness of the proposed method.
- I am not sure why the study of the regularization on the velocity fields is useful in this context. The regularization and deformation model here are not new, so it is not clear what to conclude from this.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reasonable reproducibility
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The idea of using a transformer approach for embedding the images and to capture intra- and inter-image dependencies is novel but it would be good if the authors could demonstrate more how the transformer model helps with image registration. In particular, which particular registration problem can be solved better using a transformer approach.

It would also be good if the authors could demonstrate the clinical value of their approach. The evaluation of registration using Dice metrics in the context of inter-subject brain registration is a quite weak evaluation. In the age of DL the use of registration of atlas-based segmentation is not that common.

Furthermore, the segmentations are produced automatically using FreeSurfer, so a comparison of Dice metrics is very questionable
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is an interesting novel architecture for DL-based image registration. While the clinical application and evaluation are not the strong point of the paper, the method is interesting for the MICCAI community
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose to use the transformer idea originating from natural language processing, for the task of image registration. This is a highly novel technique, with promising results, and as far as this reviewer is aware never been used for image registration. The method has been evaluated on the OASIS dataset (425 scans) and compared with a number of other methods, with favorable outcome.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novelty: first application of transformers to image registration
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The transformers seem only applied at the bottom of the U-net, at the hidden representation. At this point all feature maps have been downsized 4 times, i.e. by a factor of 16. This may limit the contribution of the transformer, and requires some discussion.
- The authors should clarify if a crucial experiment is missing from the paper or not: A registration net without any transformer components, to assess the contribution of transformers at all.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- open data
- no reference to the source code in the paper
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- A gentle conceptual introduction to the method section would be helpful to understand the paper quicker. The introduction on the other hand is quite long, so space may be delegated to the method section. For example, it would be helpful if it becomes more clear from fig 1, where the data is still represented as a volume, and where as vectorized data. I found fig. 1 in the supplement more clear here.
- The bottom encoder c in fig. 1 learns cross-image feature relevance, at least before registration, by concatenation. Would it be helpful to concatenate after applying the spatial transformation, by feature warping, to learn more meaningful cross-image-relevance? Or will the transformer module T_c right after that encoder handle such, as it is able to model long range relationships? Please discuss.
- Would it be helpful to enable transformer modules in the skip connections as well? Currently, the transformer components are only added at the bottom of the U-net, where only coarse information is available.
- It would be best to add statistical tests to the results. This can be done without much space costs.
- It would add to the value of the experiments if not only Dice is reported but also surface distances. Dice has its issues as I am sure the authors are aware.
- VM -> VoxelMorph is more clear
- To establish the contribution of the transformer components, I would expect an experiment where these components are completely left out, resulting in a normal U-net structure. Is the VM_diff experiment just that, or is this experiment missing?
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

good novelty in the paper. evaluation on large open dataset.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper proposes a dual transformer network for diffeomorphic registration, consisting of a learnable volumetric embedding module, a dual cross-image relevance learning module, and a registration field inference module. The method is evaluated with brain MRI scans of the OASIS dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main novelty is in applying encoder-decoder transformers as embedding enhancements in learning-based registration.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- this paper looks like an incremental extension of the existing registration pipeline with the well-known concept of transformer. (see also [18] Fast Symmetric Diffeomorphic Image Registration with Convolutional Neural Networks)
- The experiments were on a single dataset with a single training/test split. it’s unclear to me how the hyperparameters were chosen without a validation set.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper would have a good reproducibility given the additional info provided in the supplementary material.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- enhance the comparisons with the existing methods, explore other methods for linearizing the spatial information for the transformer
- additional experiments with other datasets
- make the results more comparable by running all the baselines using the same train/val data split and add some statistical tests.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper has limited novelty with a single dataset used to validate the method.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers appreciate the idea of using transformers, and obvious and interesting aspect to try in registration. Overall, the reviewers (and myself) liked the paper, but still had some concerns.

Importantly, there are clarifications needed on how the transformer affects registration. Both supportive reviewers emphasize this, and an ablation study might help. I also think it’s crucial when the contribution of the paper is the architecture – it’s important to understand what part of it helps, where is the insight? Should we all use dual transformers, or were the authors very good at engineering the network? Adding statistical significance tests would help in the results, and going beyond Dice as an evaluation metric would also be important in the future.

Finally, reviewer 3 has important concerns regarding the experiments which are valid. While for the rebuttal the authors are not required to run new experiments, explaining the validation split and hyperparameter for all methods choice is an important part of understand the significance of the method.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

We thank reviewers for their efforts to review our paper and their constructive comments. Reviewers recognize our contribution taking advantage of self-attention-based global and cross-image relevance for volumetric diffeomorphic registration with the semantically meaningful correspondence.

Effects of the dual transformer. The proposed dual transformer network addressed the relevance modeling and the feature enhancement on two kinds of image embedding for semantically meaningful correspondences of anatomical structures. We conducted an ablation study to analyze the proposed dual transformer network (Table 1). The proposed DTN with the dual transformer architecture outperformed the DTN_s and the DTN_c with the only transformer of T_s on the dependencies of the separate image embedding or T_c on the relevance of the concatenated image embedding. We compared with the transformer-free deep diffeomorphic registration models, including the VMdif [9] and the SYMNet [18]. As shown in Fig. 1, the encoder-decoder architecture with \phi_c and \phi_r followed the VMdif [9] when removing the dual transformer, as recognized by Reviewer #2. The attention-based global and cross-image relevance enhanced the volumetric embedding and improved the registration accuracy compared with transformer-free models (Table 1). We designed the dual architecture to identify the semantically meaningful correspondence of anatomical structures. We noticed the U-Net Transformer [s1] had transformer modules in the skip connections for segmentation. We thank Reviewer #2 for pointing out this issue on transformers on skip connections. The multiple transformers are promising to enhance features further, though enlarging the memory and computational complexity. We have implemented the self-attention learning on the voxel sequence at the bottom of the U-net. We agree that more extensive experiments with metrics beyond the Dice would be important in future work. We further conducted the statistical significance tests. There existed statistically significant improvements with the p-values < 0.05 in terms of the average DSC in the t-test. The proposed DTN outperformed the compared transformer-free models with the p-values below 5e-15 (VM [4]), 5e-75 (VMdif [9]), and 5e-20 (SYMNet [18]). The transformer branches are complementary, where the proposed DTN with the dual transformers outperformed the DTN_s and the DTN_c with p-values below 1e-6 and 1e-4, respectively.

Responses to reviewer #3: Dataset. We used the publicly available dataset of the OASIS with 425 scans [16]. We conducted the standard preprocessing for affine transformation and brain structure extraction using FreeSurfer [11] as [18]. The OASIS dataset provides brain segmentation with visual inspections, which are viewed as the ground truth in the evaluation process. The dataset was split into 256 and 150 scans for training and testing. The remaining 19 scans have been used for validation.

Experimental setting of compared methods. The proposed model was compared with the deep-learning-based registration models, including the VM [4], the diffeomorphic variant VMdif [9], and the SYMNet [18], using the publicly available implementations provided by authors with the suggested parameter setting as follows:
VM [4]: MSE similarity loss, smooth regularization weight: 0.01, learning rate: 1e-4. VMdif [9]: MSE similarity loss, smooth regularization weight: 0.01, learning rate: 1e-4. SYMNet [18]: NCC similarity loss, orientation consistency weight: 1000, smooth regularization weight: 10, magnitude weight: 0.001, learning rate: 1e-4. We used the same data splitting for training, validation, and testing datasets in comparison.

[s1] Petit, O., Thome, N., Rambour, C., & Soler, L. (2021). U-Net Transformer: Self and Cross Attention for Medical Image Segmentation. ArXiv, abs/2103.06104.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall the idea of using dual transformers was appreciated by everyone (reviewers and myself) — an interesting direction to explore. There were several limitations to the submission, especially the isolation of the key insight, what is the source of the results – is it really the dual transformers, is it other details of the architecture, etc. The authors address this in the rebuttal with an ablation study which is a good first step. I believe a more thorough analysis is required, but perhaps that is more appropriate in a journal submission.

I believe the paper should be accepted. However, everything discussed in the rebuttal has to appear in the camera ready (with appropriate time given to putting these answers into context). The important thing is to focus on understanding where the insight/improvement comes from, and to understand where and when this is useful in registration.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Authors did not use similarity measures other than DSC, but this is unlikely to be relevant for diffeomorphic registration because changes of topology should not be an issue. Authors explain that the ground truth labels they use were supplied with the OASIS dataset and visually inspected, so concerns about the algorithm matching what FreeSurfer does should be partially alleviated. The train/validation/split has been explained, and seems to be consistent. Reviewer 3’s concerns about novelty as transformers have been used previously seems to be a misunderstanding about spatial transformers, rather than transformers per se.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposes to use transformers integrated into a UNet architecture for registration. The approach is based on a stationary velocity field parameterization and an L1 similarity measure. Various comparisons to other approaches are provided (including SyN, Voxelmorph, and SymNet). The proposed approach is evaluated on OASIS for atlas-to-image registration based on segmentation overlap (Dice) from Freesurfer segmentations. Slightly improved results are shown for the proposed approach compared to the other methods. An ablation study also shows that the dual transformer strategy shows small improvements over only using one transformer. There were some reviewer concerns regarding the evaluation; in particular, regarding the comparison to the other approaches (with respect to train/test splits, hyperparameter tuning, etc.). There were also some concerns regarding the insights obtained. Some of these concerns were addressed in the rebuttal, but other concerns remained. Specifically, given the marginal accuracy differences between the different transformer approaches it is not clear how much the transformer approach really helped. The closest method (as acknowledged in the rebuttal) is VM_diff. However, this method was not tested with the same similarity measure (MSE versus L1) and the parameters do not appear to have been specifically tuned for this particular dataset. There is also no evaluation result with respect to manual segmentations or landmarks. The evaluation appears to be purely based on FreeSurfer. Hence, while the method might have merit (as highlighted by the reviewers) it also would benefit from an improved experimental design to increase confidence in the presented registration quality differences.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

back to top

Learning Dual Transformer Network for Diffeomorphic Registration