Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Davood Karimi, Serge Didenko Vasylechko, Ali Gholipour

Abstract

Like other applications in computer vision, medical image segmentation has been most successfully addressed using deep learning models that rely on the convolution operation as their main building block. Convolutions enjoy important properties such as sparse interactions, weight sharing, and translation equivariance. These properties give convolutional neural networks (CNNs) a strong and useful inductive bias for vision tasks. However, the convolution operation also has important shortcomings: it performs a fixed operation on every test image regardless of the content and it cannot efficiently model long-range interactions. In this work we show that a network based on self-attention between neighboring patches and without any convolution operations can achieve better results. Given a 3D image block, our network divides it into n^3 3D patches, where n= 3 or 5 and computes a 1D embedding for each patch. The network predicts the segmentation map for the center patch of the block based on the self-attention between these patch embeddings. We show that the proposed model can achieve higher segmentation accuracies than a state of the art CNN. For scenarios with very few labeled images, we propose methods for pre-training the network on large corpora of unlabeled images. Our experiments show that with pre-training the advantage of our proposed network over CNNs can be significant when labeled training data is small.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_8

SharedIt: https://rdcu.be/cyhLA

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper presents an attention-based model for the segmentation of 3D medical images. The results show that the proposed network can outperform U-Net network on three medical image datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper is easy to follow
- Using attention for the segmentation of medical images is a novel idea
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Lack of novelty.
- Lack of comparison with state of the arts
- Lack of implementation details.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

not reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Only compares with U-Net. U-Net, which is not the state-of-the-art for the 3D image segmentation
- Lack of novelty. Only using several attention layers for the segmentation is not novel enough
- Lack of experiment details. As is well known, data augmentation, network size, # of network layers are all essential settings for the network training, especially for small datasets used in this paper. All these settings should be detailed for fair comparison.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Lack of novelty.
- Lack of comparison with state of the arts
- Lack of implementation details
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper explores a Transformer model inspired self-attention bases network architecture for Image segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-Explores transformer models as an alternative to convolutional models. -Compares proposed model with an existing method (UNet++)
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-Dataset size is too small (20-200 labeled images). This makes it hard to comment on the generalization of the proposed model, or if the models are just overfitting -Missing motivation and detailed discussion about how going for convolution-free models helps.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The dataset is not publicly available. The result should be duplicated given the model hyper-parameters mentioned in the paper descriptions
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

-More extensive comparative study should be presented for reasonably-large publicly available data. As the paper pitches for convolution-free models, which are extensively used for so many applications, you need to have more detailed discussion like- why this is better than convolution-based model, why can’t we have a mix of both convolution and self-attention. Show the comparison of number of trainable parameters for all these versions
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the results on very small datases, it’s hard to comment if the slightly improved results are just because of higher model capacity of transformer networks or overfitting.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The manuscript proposed a new neural network architecture without using convolution for 3D medical image segmentation. The proposed network incorporated the multi-head self-attention modules from transformer models in NLP domain. And the manuscript also introduced a model pre-training strategy with unlabeled data to improve segmentation accuracy.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The contents of the submission are about applications of transformer layers in the field of medical image segmentation, which is highly relevant to the MICCAI audience.
2. Experimental results support the claims made in the paper.
3. The paper is well-organized and well-written.
4. The manuscript is technically sound.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is unfair to claim the proposed network as the first “convolution-free” neural network model for medical image segmentation. The following literatures studied the multi-layer perceptron (MLP) for segmentation tasks more than a decade ago. a. “Suzuki, K., 2017. Overview of deep learning in medical imaging. Radiological physics and technology, 10(3), pp.257-273.” b. “Zhao, Y., Zan, Y., Wang, X. and Li, G., 2010, May. Fuzzy C-means clustering-based multilayer perceptron neural network for liver CT images automatic segmentation. In 2010 Chinese control and decision conference (pp. 3423-3427). IEEE.”
2. Performance is not close to the state-of-the-art methods. Both pancreas and hippocampus segmentation has clear compared to state-of-the-art methods’ performance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The proposed network function might be difficult to implement.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. Please add a description of how to create full segmentation mask in practice. Whether to use sliding-window based algorithms or other patch-based methods?
2. Please add a description of datasets. It is unclear that the tasks are binary segmentation or multi-class segmentation in the experiments.
3. Pre-training might be necessary for the transformer-based models. Why not using pre-trained models from the ViT paper and checking the performance? Since the transformer or self-attention layers use 1D embeddings anyway.
4. What does the failure case look like? Please add the necessary explanations to describe the reasons for the failure of the proposed method and the limitations of the method.
5. Please further compare the capacities of the proposed network with other compared networks in terms of parameter amount, FLOPS, GPU memory consumption.
6. What is the motivation to conduct inference for the central patch instead of 27 patches? Why not inferring for all 27 patches?
7. Would adding more patches as input (e.g., 125 patches) improve the segmentation performance? What is the inference efficiency of the proposed network and other baseline network per volume?
8. Does the serializing order of patches matter for the final accuracy?
9. Why not using the same method to pre-train U-Net++ for a fair comparison?
10. What is performance of pre-training tasks after training? It is better to visualize the reconstructed images.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The manuscript presented a new neural network for 3D medical image segmentation. The novelty is okay, but necessary details and analyses were missing.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper presents an attention based model for 3D segmentation. There are several weaknesses that have been identified by the reviewers. The novelty is limited as only several attention layers are used. The comparison experiment is only conducted with U-net, and also it lacks a lot of experimental details making the proposed method might be difficult to reproduce.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We thank the reviewers and Area Chairs for their efforts.

** Novelty Reviewer 2 and Meta-Reviewer have questioned the novelty of our work. As we have stated in the paper, our paper is “the first work” to propose a convolution-free model based on transformer networks for 3D medical image segmentation. Our proposed network is “completely novel”. We have shown that this model achieves “statistically significantly” better results than a state of the art FCN on 3 datasets. This contribution is novel and highly impactful. Moreover, we are surprised that Reviewer 2 and Meta-Reviewer have completely missed our other contribution that we have explicitly stated on page 3: novel methods for pre-training these models using unlabeled images. We have demonstrated the effectiveness of these methods (Figure 3). This is a completely novel and very important contribution because these networks are well known to depend on large labelled datasets. Our proposed pre-training methods are clearly an important step in addressing this challenge.

** Comparison with state of the art Reviewer 2 and Meta-Reviewer have criticized us for not comparing our method with the state of the art, because they say that we have compared against UNet. That is not true. We have compared with UNet++, not UNet. We have explicitly mentioned this throughout the paper and we have cited the UNet++ paper in Section 2.2. Nowhere in the paper we have mentioned UNet. We also applied three other state of the art FCNs (not included in the paper) on these datasets and they did not perform better than UNet++. UNet++ is indeed a state of the art FCN, and our results show that our method significantly outperforms UNet++ on 3 datasets (Table 2).

** Implementation details Reviewer 2 has mentioned that our paper “lacks implementation details”. This opinion has been echoed by Meta-Reviewer. Meta-Reviewer has not mentioned any specific missing implementation detail. Reviewer 2 has mentioned “data augmentation, network size, # of network layers”. We used the same data augmentation for our method and UNet++. Because data augmentation was the same for our method and UNet++, we did not include it in the paper, but we can easily add this information. On the other hand, we have clearly specified “network size” and “number of layers” in the paper (beginning of Section 3). We have even reported and discussed experimental results regarding the size and number of layers for our model (Table 3). Overall, we are surprised by this comment because we have spent approximately 3 pages to describe our method in full detail (Sections 2.1 and 2.2). Although Section 2.2 is titled “Implementation”, Section 2.1 also provides details of our method and Table 3 presents related experiments.

** Small datasets Reviewer 3 (and Reviewer 2) have stated that our datasets are small. We respectfully disagree. One of our datasets (brain cortical plate) had approximately 20 training images because cortical plate is extremely time-consuming to segment manually. But the other two datasets, each, had more than 200 training images. This is indeed large for medical image segmentation. Even some public datasets are not this large.

In Summary: We believe our work does have sufficient novelty and significance to be presented at MICCAI: (1) We have developed “the first” transformer-based network for 3D medical image segmentation. (2) We have shown on 3 datasets that our model statistically significantly outperforms a state-of-the-art FCN: UNet++. This is very important as FCNs are considered to be the best medical image segmentation models. (3) We have proposed completely novel methods for pre-training such networks on unlabeled data. This is an important and significant contribution as transformer networks are known to be very data-hungry. We also believe that major criticisms mentioned by the respected reviewers are incorrect and based on misreading of our paper. We look forward to sharing our work with the MICCAI community.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has made clear about the novelty and implementation details. The paper is sufficient to be published at MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After reading the paper, reviewers comments and rebuttal,I feel that the authors strongly overclaim their contributions. While authors argue that this is the first convolutional-free segmentation method, to my point of view this fact does not grant acceptance of the work. In the last months there have been many segmentation methods based on transformers, and I do not see much difference with them. Thus, I align with reviewers and meta-reviewer in the limited novelty. Furthemore, authors also claim as their own contribution ‘novel methods’ to pre-train deep models with unlabeled images. Similarly, this is a strong claim, as it is well proven that pre-text tasks (either impainting, denoising, or others such as solving puzzles or predicting image rotation) are useful to pre-train a network in a self-supervised manner. As a simple example, authors in [1] demonstrated that other pre-text tasks help to pre-train a segmentation model in 3D segmentation (but there exist many more). Thus, the strongly claimed third contribution is not novel either. And last, I do agree with reviewers in the incomplete empirical validation, particularly the comparison with state of the art. Given the fact that transformers resort to attention, authors should have compared to attention-based models. Their answers in their rebuttal are therefore not convincing and, to my point of view, overclaimed.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

20

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposed a transformer network approach (instead of a CNN) for medical image segmentation. It also proposed a pre-training strategy. Results show that improved segmentation accuracy can be obtained over a UNet++ segmentation approach, as demonstrated on three different datasets. Reviewers were somewhat split in their evaluation, but the main criticisms revolved around novelty and comparison to other methods. The authors compellingly argued in their rebuttal that the comparison to UNet++ in fact should be considered a comparison to a state-of-the-art segmentation approach (and is different from a comparison to a UNet) and that the contribution was not only in simply applying a transformer, but also to provide an effective pre-training approach. Though the effect of pre-training does not seem to be assessed via an ablation study.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

back to top

Convolution-Free Medical Image Segmentation using Transformer Networks