Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, Vishal M. Patel

Abstract

Over the past decade, deep convolutional neural networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to inherent inductive biases present in convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed transformer-based architectures that leverage self-attention mechanism encode long-range dependencies and learn representations that are highly expressive. This motivates us to explore transformer-based solutions and study the feasibility of using transformer-based network architectures for medical image segmentation tasks. Majority of existing transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, in medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical imaging applications. To this end, we propose a gated axial-attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code will be made publicly available after the review process.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_4

SharedIt: https://rdcu.be/cyhLw

Link to the code repository

https://github.com/jeya-maria-jose/Medical-Transformer

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes MedicalTransformer network for medical image segmentation, specifically it introduces a gating mechanism to better learn the positional encoding – which is useful for training transformer networks on smaller datasets, and makes use of a local-global training strategy. The authors show their method performs better than fully convolutional networks on 3 medical segmentation tasks. It also performs better than other transformer based networks when the training dataset is relatively small.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the need of large training datasets for transformer based models makes it difficult to be used for most medical segmentation tasks → the gating mechanism seems to help here
- there is not much work published yet for transformer networks in medical image segmentation
- the paper is well written and easy to follow
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- the authors put a lot of relevant information into the supplemental material – e.g. information about the datasets and description of the MedT architecture. Some of this information is crucial for understanding the paper and experiments.
- the experiment tables only provide mean values, no measures of variance or statistical significance testing is reported
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- the authors use publicly available datasets and plan to publish their code
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- In Table 1 the results for MoNuSeg show no improvement of the gated axial attention network over the fully attention baseline. Do the authors have an explanation for this? Since the fully attention baseline performs worse than convolutional baselines I would guess this is due to small dataset size which is exactly what the gated attention is supposed to overcome.
- page 2, first paragraph: ‘…learning discriminating features’ probably means ‘discriminative features’
- Figure 1 caption: ‘…misclassified due to lack of learned long-range dependencies’ this is a hypothethis by the authors and should be marked as such
- page 2: last paragraph: ‘given a single pixel, the network needs to understand whether it is more close…’ in convolutional neural networks the network is never just given a single pixel
- page 2, last paragraph: ‘…prevent misclassifying a pixel as the mask leading to reduction of true negatives’ - I think you mean reduction of false negatives here
- page 3 second paragraph: ‘We observe that the transformer-based models work well….’ here the authors cite another paper, and even though they observe the same thing in their experiments later, this should be written as ‘It was observed..’ or similar
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Even though the authors put a lot of relevant information into the supplemental material, I believe this paper is interesting to the community and shows promising results which could be useful for training transformer networks on smaller medical datasets.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Review #2

Please describe the contribution of the paper

The author try to propose a new transformer model for medical image segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) proposing a gated axial-attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. (2) proposing a Local-Global training strategy (LoGo) which further improves the performance to train the model effectively on medical images.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) Experiments are only on 2D image.

Most medical images are 3D images, but the author only implement their experiments on 2D medical image segmentation tasks. Considering 3D format is one most different feature between medical image (CT, MR, X-Ray, etc.) and natural image, the author should include 3D medical image segmentation tasks to show MedT outperforms other SOTA results.

(2) LoGo strategy is not very transformer style, using UNet (encoder & decoder) style to remedy its performance on 2D images.

Popularity of transformer in computer vision is very relevant to its novel design: which crops one whole images into many patches(sequences) and meanwhile saving its non-local correlations using attention mechanism. However, from Fig 2 seems that MedT could not meet requirements thus it has to introduce “Global Branch” which uses “encoder and decoder” framework on the whole images(128128 or 512512) to remedy shortcomings of “Local Branch”. This LoGo strategy seems like a makeshift between a transformer model (like “Local Branch”) and encoder & decoder model.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I think this paper should meet reproducibility requirements of MICCAI.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

(1) Implements more experiments on 3D medical images especially some famous 3D medical segmentation tasks datasets such as KiTS(https://kits19.grand-challenge.org/), LiTS(https://competitions.codalab.org/competitions/17094) and MSD (http://medicaldecathlon.com/), comparing with really SOTA (AKA, nnUNet, https://github.com/MIC-DKFZ/nnUNet/ ) is highly appreciated.

(2) Proving MedT could outperform most encoder&decoder model such as (FCN, UNet, etc) with only transformer architecture or showing ablation results on with/without “Global Branch”.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The results are all implemented only on 2D, which is not typical in medical image segmentation tasks.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper introduces a Transformer based network for medical image segmentation called Medical Transformer which leverages 1) a gated axial attention 2) a two-branch structure where one of the branches focuses on the whole image the other one on patches. The proposed network is evaluated on three datasets and achieves better results than previous CNN/Axial-attention based methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The introduced gated axial attention requires less attention computation which is beneficial for medical images where image sizes are larger than natural images but the datasets are smaller for training KQV attention.

2) The two-branch design could leverage both global and local information and combine them together.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) The authors mention that ‘the positional bias is difficult to learn and hence will not always be accurate in encoding long-range interactions’. From the paper’s current results, I could not tell whether using the traditional axial attention learn worse long-range interactions or not. For example, in figure 3., both Axial attn U-Net and MedT are able to capture the authors called ‘long-range dependency’ (the pixels in red box) which seems to me that this long-range dependency is achieved by the axial attention instead of the gated mechanism.

2) still this ‘the positional bias is difficult to learn and hence will not always be accurate in encoding long-range interactions’. Since the authors aim to improve the positional encoding, it is necessary to compare with fixed positional encodings, e.g. sine/cosine encoding.

3) When comparing with related works, the authors mention that ‘However, the encoder and decoder of these networks still have convolutional layers as the main building blocks.’ Can the authors describe more about how they extract feature map from the input image? By directly slicing the image into patches then flatten?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code is submitted. The authors mention that the code will be made publicly available after review process.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please see above comments. Besides, it seems inappropriate to call the LoGo a training strategy. It is more like a network structure design.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

1) the introduction of transformer in medical image segmentation is novel.

2) the introduced gated mechanism improves segmentation results.

3) however, I believe the paper can be polished futher by adding more ablative studies, e.g. comparing with sine/cosine position encodings, comparing with standard fully KQV transformers to prove this paper’s claims on solving the ‘long-range dependency’ and ‘not sufficient medical data to train’ problems.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work attempts to introduce the recently widely studied transformer concept for medical image segmentation. Reviewers agree that this is novel, however there are concerns regarding (1) the influence of the global encoder decoder network at the end, which seems to be required, and a lack of ablation studies investigating this influence. Comparisons to standard encoder decoder based networks show very similar results in the benchmarked datasets. (2) lack of statistical evaluation of results to see if the seemingly small improvements are significant. (3) lack of experiments on 3D benchmark datasets or discussion/limitation section explaining why these were omitted. (4) the paper relying too heavily on the supplemental material to show aspects. Note that the MICCAI 2021 submission guidelines state that Supplementary Material may only contain images, videos, tables or proofs, but not textual components needed to understand the work.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Author Feedback

We sincerely thank reviewers for their valuable feedback. In what follows we provide clarification to points raised by the reviewers.

Influence of the global transformer branch (Meta-reviewer, R2):

On the contrary to R2’s comment that “transformers are relevant only because they work on patches”, many transformer models like [ref 7,24] work on full scale of the images. The major benefit of transformer architectures is their ability to encode long range dependencies in the input. The transformer model design that work on patches, do so to avoid computational complexity. We leverage both types of architectures by having a shallow global branch and a deeper patch based local branch. The motivation behind using the global branch is to learn the long range dependencies across the full image scale, while local branch follows the patch-processing as used in the works referred by R2. Although global branch follows an encoder-decoder pipeline, the architecture design is significantly different from the networks like U-Net, as it is fully modeled using the proposed novel gated-axial attention blocks. These details are explained in Sec 2.3. We will update it to emphasize this point further. Also, please note that the influence of the global branch becomes very clear from ablation study provided in supplementary material (Table.1), where addition of global branch (LoGo baseline) improves the performance of local-only baseline by 10%.

Lack of Ablation studies (Meta-reviewer, R2): Due to lack of space, the ablation studies were added to the supplementary material. Please refer to Table 1 in supplementary where we conduct an ablation study using only Global branch and Local Branch which addresses R2’s concern on the importance of the global branch.

Significance of performance improvement (Meta-reviewer, R2 ): We would again like to correct R2’s misunderstanding of the global branch as a simple “encoder-decoder” model. The global branch is fully-transformer based model designed using the proposed gated axial-attention model. MedT’s both branches are transformer blocks without any convolutions as explained in Fig 2. We will update the caption to make this point more clear. R2’s misunderstanding leads to an incorrect critique that the proposed “transformer architecture” does not outperform FCN, UNet baselines. In contrast to the statement the “transformer architecture”, i.e., MedT, does give a significant improvement of 6.05 %, 14.41 %, 50.71 % over FCN and 3.14 %, 3.24 %, 0.12 % over U-Net across all 3 datasets, respectively.

Lack of experiments on 3D datasets (Meta-Reviewer, R2): R2’s suggestion to explore benefits of MedT architecture on 3D medical segmentation applications (CT,MR, etc) is indeed interesting. However, due to lack of space, in this paper we focus on 2D medical segmentation and thoroughly evaluate MedT on multiple applications such as microscopic and ultrasound segmentation. Also note that, we evaluate MedT across 3 different datasets where we obtain significant performance boost over baseline methods.

Details in supplementary Material (Meta-Reviewer, R1): We will make sure to add the tables of ablations studies and comparison of parameters currently found in supplementary to the main paper in our final version. We will not need extra space for this as we plan to append these details to Table 1 of the paper. As we will provide the link for code in the final version, other details found in the supplementary material will be substituted by the code.

Gated vs Axial attention (R3): Please refer to the 1st, 3rd and 4th rows of Fig 3 which shows how gated axial attention performs better at learning long range dependencies than normal axial attention based network.

Comparison with fixed positional embedding (R3): We see from [ref 24] that using learnable positional embedding works better than having fixed embedding for segmentation tasks.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work attempts to translate the transformer type ideas to medical image analysis, which is original and of merit as R1 and R3 indicate. While there are still challenges in this transition, as indicated by the necessity of an encoder decoder structure in the network, nevertheless the idea is worth being discussed at MICCAI. The rebuttal convincingly clarifies some misunderstandings that R2 may have had. This meta reviewer therefore votes in favor of this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Authors have addressed reviewers and meta-reviewer comments, particularly related to additional ablation studies, or the use of another (3D) dataset. A major concern, however, is that the paper and the empirical validation is not self-contained. Based on this (reviewers comments and rebuttal from authors) I would recommend acceptance. Nevertheless, I would suggest to stress the differences between this work and [24], as the technical contribution respect to [24] is marginal. Furthermore, I believe that the baselines used in Table 1 are weak baselines, as there exist more recent attention-based methods that outperform the approaches listed.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper utilized the recent famous transformer for medical image segmentation. But due to limited pages, a lot of important experiments and results are not included in the main body. Though the author provide some information about these issues and these information will be very hard to be included in the final version. Moreover, statistical importance, such as t-test, should be added to verify the effectiveness of the proposed method.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

16

back to top

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation