Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yunhe Gao, Mu Zhou, Dimitris N. Metaxas

Abstract

Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

SharedIt: https://rdcu.be/cyl3I

N/A

N/A

Reviews

Review #1

• Please describe the contribution of the paper

This paper presents a hybrid CNN-Transformer architecture for medical image segmentation. The contributions include architecture design and a spatial reduction technique to improve efficiency.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The hybrid architecture of Transformer technically sounds.
2. The spatial dimension reduction can effectively reduce the computation cost, which is important in the field of medical image segmentation.
3. Experimental results are solid and the ablation studies are sufficient.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Although the spatial reduction is effective and technically sound, I don’t think the computational complexity is reduced to O(n), which is a major flaw of the description. It is correct that due to this design, the complexity is reduced from O(n2d) to O(nkd). However, k is still proportional to n, (1/64 of n as stated in the paper). E.g. if the input image is twice as large in width and height, the total computation still increases by a factor of 16 = (2*2)^2. It is still O(n^2). This flaw should be fixed in my perspective and strongly affect the credibility of the overall description.
2. The experimental results show promising improvement when using 2D images. However, since this approach can be easily applied to 3D, 3D-style segmentation should be tried. Because in 3D MRI/CT scans, many papers have proved that 3D networks are more effective than 2D networks.
• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good reproducibility.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The overall quality of this paper is good. Please refer to the previous weaknesses. To summarize, the time complexity is improper and I recommend the authors to add 3D experiments.

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper provides a reasonable designed hybrid transformer for medical image segmentation. The paper is generally of good quality. The major flaw lies in the overall justification of the time complexity, which I strongly recommend the authors to fix.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

8

• Reviewer confidence

Very confident

Review #2

• Please describe the contribution of the paper

In this paper, the authors introduce the transformer technique into medical vision domain, and propose a hybrid transformer architecture for enhancing medical image segmentation. Sufficient experiments demonstrate its robustness and superior segmentation performance.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper is overall well-organized and clear enough for readers. The idea of hybrid transformer strategy for solving medical image segmentation is simple and novel to me. Definitions and formulas are clear and logical, with detailed explanations.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– I prefer more discussions about the computational complexity of the proposed method. – Fig.3 is too small, it is not easy to check the numbers and curves, and the visual comparison in Fig.4 is not clear enough. – Some related comparable results may further show the method is efficient yet effective.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Please make the code and at least a few samples of your dataset available for the research interest.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

See Section 4

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See Section 3

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Very confident

Review #3

• Please describe the contribution of the paper

The paper proposes a self attention mechanism that fuses the standard convolutional architecture with a modified transformer architecture that reduces the complexity by means of a subsample operation. The proposed method achieves state-of-the-art performance in the recent “multi-label, multi-vendor cardiac magnetic resonance imaging (MRI) challenge cohort“.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method achieves state-of-the art results that match those of the winners of the challenge. Is a simple idea that might be adapted for other tasks in biomedical image processing.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Lack of novelty. Self-attention mechanisms are well known for image segmentation and biomedical image segmentation, including those that mix standard convolutional operations and self attention[A][B]. In my opinion the only contribution of this paper is to show that the subsampling strategy is compatible with the self-attention mechanism.

Unsubstantiated claims. The authors claim that the proposed subsampling operation results in an efficient implementation of self-attention. Table 1 shows that UTNet is the second largest network by parameter count and the second slowest network by inference time. Moreover, UTNet should be faster than alternatives that use standard attention but [4] uses dual attention and is about 22% faster than UTNet.

Finally, the performance increase seems very marginal when compared to the ResUnet baseline. On average the authors show an improvement of 1.2% over the baseline, but UTNet also contains 2% more parameters.

[A]Sinha, Ashish, and Jose Dolz. “Multi-scale self-guided attention for medical image segmentation.” IEEE journal of biomedical and health informatics (2020). [B] Wu, Yan, et al. “Self-attention convolutional neural network for improved MR image reconstruction.” Information sciences 490 (2019): 317-328.

• Please rate the clarity and organization of this paper

Satisfactory

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

While the proposed idea is simple the implementation of transformers is not highly sensitive to yperaparemets and weight initialization. Without the code (or at the very least in depth details of the implementation) it will be hard to reproduce the reported results.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

There are multiple typos and grammar errors on the text, it must be trougly revisited. Additionally there are some redundant sentences that make it hard to understand some of the detailed descriptions on the method..

Overall, the paper needs to be improved to highlight the effectiveness of the proposed method. While the results are highly competitive with the state-of-the-art, it is not clear if the improvement must be attributed to the extra parameters included in the attention module. The claims regarding the efficiency must be tuned down as the method is clearly slow and contains a large amount of parameters.

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper shows competitive results with the state-of-the-art, tis contribution is not clearly validated. I cannot assert that the improvement in the results is directly attributed to the proposed model or in particular the proposed sub-sample.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper has received mixed scores. R1 and R2 believe that the proposed method is novel, whereas R3 highlights that the technical contribution is incremental. In particular, R3 states that self-attention mechanisms are known in medical image segmentation (R3 gives very relevant references not included in the paper - neither in the discussion nor in the comparison) and argues that the only contribution is to show the compatibility of subsampling with self-attention mechanisms. R1 and R3 also have major concerns regarding the claims about the time-complexity of the proposed approach. Furthermore, R3 raises an important question regarding the source of the improvement of the proposed method, pointing to a lack of proper ablation study on the experimental section. R2 and R3 stressed the hard reproducibility of this work, as no code is given and a significant lack of in-depth implementation details. Last, this AC wonders why authors did not compare to participants in the M&Ms challenge, as the dataset employed in this work is provided in this MICCAI satellite event. Please highlight the response to these issues in your rebuttal.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

Author Feedback

We appreciate all valuable comments from the three reviewers (R1 to R3) and the meta review. The following is our response to the concerns on novelty clarification, computation complexity, and comparison. All updates will be made in the final submission.

Novelty clarification (R3) We like to reiterate that the proposed Transformer architecture merges advances of convolutional layers and Transformer block towards multi-scale medical image segmentation. Our method addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of the Transformer with convolution without a need of pre-training. Such design enables local texture and global information aggregation simultaneously derived from both low to high-level feature maps. This is achieved by dramatically reducing the complexity of self-attention and extending into high-resolution feature maps, which was impossible before due to time and GPU memory limitation raised from the quadratic complexity.

For mentioned related works [A, B], besides efficiency, the key differences are evident: (1) We apply a Transformer block including multiple attention heads, query/key projection and position encodings, while regular self-attention doesn’t use them. (2) They use self-attention on top of backbone networks, while we propose a backbone architecture that integrates Transformer blocks with convolutional layers.

Computational efficiency (R1 and R3) The actual time complexity is O(nkd), while k can be set as a small number (not necessarily proportional to n). Thus we conclude the complexity to be the approximated O(n) in the paper as a major contribution of our work.

Comparison to Dual-Attention (R3) The time cost of self-attention is not from the parameters, but mostly from the pairwise similarity computation. The Dual-Attention model in Table 1 only applies self-attention at the lowest resolution (after 4 downsampling), while our UTNet applies the efficient self-attention in four resolutions (1,2,3,4 downsampling).

For a head-to-head comparison, we added an experiment that applies Dual-Attn in four resolutions (1,2,3,4, same as UTNet), and use the same input image size and batch size (25625616) to test the inference time and memory consumption (only inference, no gradient backpropagation). Clearly, our work gains advantage over Dual-Attn with quadratic complexity, where GPU memory: 36.9GB (Dual-Attn) vs 3.8GB (Ours) and time: 0.243s (Dual-Attn) vs 0.146s (Ours).

The improvement of performance might come from more parameters. (R3) More parameters don’t always mean higher performance. For example, the number of base channels of the ResUNet in Table 1 is 32. We also did an experiment with a base channel of 64 (100% more parameters), which has even lower performance (87.7 vs 88.3). Thus we strongly believe that 2% more parameters leading to 1.2% performance gain is reasonable.

Reproducibility (R2 and R3) We’ve created an anonymous GitHub and released our model definition code. The full training code will be released after acceptance. https://github.com/anonymousgit873/UTNet

Comparison with M&Ms challenge (Meta-review) Our approach is competitive among the top 3 on the leading board on M&M challenge. We didn’t directly compare with them because we are in different settings. They used adaptation mechanisms to train model and leverage the unlabeled vendor C images, while UTNet is trained regularly in vendors A, B and directly generalized to vendors C,D. Most domain adaptation mechanisms are orthogonal to our methods, and we are not aiming at the adaptation task. For example, the top 2 teams used extensive augmentation or leveraging unlabelled sequence and style transfering among vendors for enhanced training. Even so, our method retains comparable results to top 2 results, and is higher than the third player in all three tasks: (ours-theirs) LV: 0.908-0.902 MYO: 0.837-0.835 RV: 0.876-0.874.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Authors have addressed most comments on their rebuttal, whereas arguments to other concerns are not very convincing (e.g., authors claim that [A,B] are backbones integrating attention, while their method is a new backbone - however, their model is basically UNet integrating the transformer attention module). Multi-scale feature aggregation (both with and without attention) have been proposed in the literature. Thus, technical contribution is rather limited. Given that the authors have addressed raised concerns and the empirical validation is relevant I recommend acceptance of this work.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The majority of reviewers found positive remarks and at least a good incremental method contribution. The main concerns and remarks (complexity and openness of code) have been addressed by the authors (they even provided an anonymous GitHub with code). The memory reduction is impressive - but I agree that similar ideas have been employed for self-attention. They did not answer the questions regarding 3D - it appears to be challenging, but not necessarily important for MRI (with thick slices). The only important change I would request is to reference the winning approach of the M&M challenge the nnUNet: https://zenodo.org/record/4288464#.YLomyS0RpEJ

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Most of the concerns have been satisfactorily addressed during the rebuttal process. Authors have also released their source codes in an anonymous manner to prove it’s reproducibility. However, there are some concerns, which were not discussed during the rebuttal process. 1) the key difference to the existing self-attention mechanism is still unclear; 2) both R1 and R3 raised concerns about the time-complexity. Based on the rebuttal, it seems that the time complexity of O(n) was derived experimentally rather than theoretically? Overall, this paper introduces a hybrid approach which combines of transformer and CNN for medical image segmentation, which is novel and interesting to the field. The experiments are sound with a MICCAI challenge dataset.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8