Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yutong Xie, Jianpeng Zhang, Chunhua Shen, Yong Xia

Abstract

Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing. Although Transformer was born to address this issue, it suffers from extreme computational and spatial complexities in processing high-resolution 3D feature maps. In this paper, we propose a novel framework that efficiently bridges a {\bf Co}nvolutional neural network and a {\bf Tr}ansformer {\bf (CoTr)} for accurate 3D medical image segmentation. Under this framework, the CNN is constructed to extract feature representations and an efficient deformable Transformer (DeTrans) is built to model the long-range dependency on the extracted feature maps. Different from the vanilla Transformer which treats all image positions equally, our DeTrans pays attention only to a small set of key positions by introducing the deformable self-attention mechanism. Thus, the computational and spatial complexities of DeTrans have been greatly reduced, making it possible to process the multi-scale and high-resolution feature maps, which are usually of paramount importance for image segmentation. We conduct an extensive evaluation on the Multi-Atlas Labeling Beyond the Cranial Vault (BCV) dataset that covers 11 major human organs. The results indicate that our CoTr leads to a substantial performance improvement over other CNN-based, transformer-based, and hybrid methods on the 3D multi-organ segmentation task. Code is available at: \def\UrlFont{\rm\small\ttfamily} \url{https://github.com/YtongXie/CoTr}.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_16

SharedIt: https://rdcu.be/cyl3S

Link to the code repository

https://github.com/YtongXie/CoTr

Link to the dataset(s)

https://www.synapse.org/#!Synapse:syn3193805/wiki/217789


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors of this paper present a method that combines a CNN encoder, a transformer based mapping and a decoder for the segmentation of multiple organs in abdominal HRCT scans. The main contribution lies at the transformer and the fact that the authors modify it such that it focuses only on a small set of key sampling locations instead of all possible and the fact that it was applied in 3D medical images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and a clear ablation study is presented.
    2. The use of sampled locations as input to the transformer.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is the fact that the authors use a single metric (dice) and do not report standard deviation of the performance. Dice could be a bit biased for large organs. Additionally, the database seems to me a bit small. Results on some more datasets will be highly appreciated, considering that there are a number publicly available.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    High reproducibility:

    1. Publicly available data
    2. Code attached
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. “… , and hence is superior to convolutional operations in modeling the long-range dependency”: Such sentence better suits in a discussion section rather than an introduction.
    2. A figure with a few examples of the BCV dataset can enhance the quality and readability of the paper.
    3. “the CNN-encoder cannot capture the long-range dependency of pixels”: As it was also mentioned in the introduction, long-range dependencies are heavily associated with the receptive field of the network. I find these statements a bit strong. I would suggest to clarify this points.
    4. The term deformable is typically associated with registration pipelines.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Mostly because of the ablation study and the fact that a transformer pipeline is utilized for segmentation.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    This paper develops a Convolutional neural network and a Transformer (CoTr) with a CNN encoder, a deformable transformer (DeTrans) layer, and a decoder for accurate 3D medical im- age segmentation. Detrans pays attention only to a small set of key positions by introducing the deformable self-attention mechanism. The model is validated on Multi- Atlas Labeling Beyond the Cranial Vault (BCV) dataset and achieved superior performance over other variations of transformer architectures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    a. Develop a computationally efficient transformer-based segmentation network for 3D medical image segmentation. b. Introduce the deformable self-attention mechanism to choose the attentive position in the image to operate a small set of key positions. This helps to reduce the computational cost. c. Proposed model obtain competitive performance over other transformer and hybrid methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    a. My main concern with the paper is the extremely small dataset. The dataset contains only 30 CT scans where 15 scans for training, 6 scans for validation, and 9 scans for testing. b. The way of summing dice and cross-entropy loss should be mentioned in the paper. c. Though this architecture is an interesting way of combining CNN encoder, deformable transformer layer, and decoder, most of these modules are well-established in computer vision. d. The meta-information of the CT scan dataset are missing for example, voxel spacing, original dimension. e. The model should be validated with a bigger dataset such as BraTS, LiTS

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The work uses an open dataset and the code is provided with instruction in readme file.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    a. It is required to validate the model into a larger dataset b. Meta-information of CT scans are missing c. How to combine two loss functions?

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    a. Efficient transformer based segmentation network for 3D medical image segmentation. b. Interesting way to fuse deformable transformer with CNN encoder and decoder. c. Proposed model outperforms other forms to transformer based architectures.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    This paper proposes a hybrid framework for efficient training of transformers and CNNs for 3D medical image segmentation applications. The proposed model leverages a CNN model for feature extraction and a a deformable transformer model to capture long-range dependencies in extracted features. The extracted features are further upsampled to the image resolution using a separate decoder. The model is tested on Multi- Atlas Labeling Beyond the Cranial Vault (BCV) dataset and compares favorably to competing approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to follow. Additionally, it provides a comprehensive review of existing methodologies in this area.

    2. The model provides a rather unique architecture by connecting the output of the transformer to different resolutions of the decoder, as opposed to merely using the transformer as an attention layer in the TransUnet model. This architecture should better capture multi-scale representations in extracted features.

    3. The use of the deformable transformer layer seem to be novel in this context, although it has been previously proposed by 1.

    4. The model has been extensively evaluated and compared against competing approaches.

    transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The deformable transformer layer is presented as a key novelty in this work. However, the paper does not fully demonstrate the benefits of utilizing such a layer as opposed to an ordinary transformer layer. For instance, feature map visualizations can better delineate the inner working of such layers. Additionally, a compelling comparison is to replace all the deformable transformer layers with ordinary ones and quantitatively compare the performance of the two models.

    2. The best variation of the proposed model marginally outperforms the TransUnet model. Considering the enhancement in the architecture and the fact that it directly operates on 3D volumes as opposed to a 2D slice-wise scheme, the performance gain should be higher.

    3. The paper also compares against the SETR 1 model which was originally proposed for segmentation in the natural images domain. SETR has proposed 3 different variants for decoder. Which decoder architecture is used in this evaluation ? the choice of the decoder can impact the presented benchmarks.

    4. The model is only evaluated on a single segmentation dataset for organ segmentation in CT modality. More tasks/modalities should be investigated.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The responses meet the reproducibility criteria on the checklist.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. Provide visual comparisons for segmentation outputs of different methods in the main text.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes a novel hybrid framework for image segmentation by bridging the gap between CNNs and transformers. Although the experimental results does not show significant gain with respect to the TransUnet architecture, it still has merits and can be considered for publication.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes to improve a recent tend on using Transformers by integrating it to a Convolutional setting, with an added deformable Transformer that focuses attention on a few key points. This alleviates the computational complexity of standard Transformers.

    One reviewer appreciate the proposed alleviating of the complexity of transformers, but wish there would be more evaluation metrics than Dice.

    A second reviewer also notes the computation efficiency with a novel self-attention mechanism, but wish evaluation was on a larger dataset.

    A third reviewer appreciates the novel deformable transformer, and suggests improvements in the evaluation, but is less impressed with the reported improvements in performance over the standard TransUNet.

    All reviewers sound to agree on the originality of the proposed Transformer approach, which adds a deformable self-attention mechanism to alleviate a computation burden by focusing the decoding on a few key points. The improvement may not appear grandiose, but the proposed approach is a new alternative contribution to the field than TransUNet. As one reviewer suggested, the used dataset is small, and evaluating it on a larger scale may better clarify in which scenario this proposed transformer architecture would better perform.

    For these reasons, I believe this paper offers a contribution to the field that may turn better in certain future scenario, but in its current form, a response may be required on clarifying the motivation of the experimental setting. Recommendation is towards an invitation for a Rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4




Author Feedback

We sincerely thank all reviewers and ACs for their invaluable comments and the approval of originality. The code of this work is available.

1.Using alarger dataset(R1&R2&R3) We supplemented an experiment on the Liver and Tumor Segmentation (LiTS) dataset, a large-scale dataset with 103 CT scans for training and 27 CT scans for test. Here are the obtained average Dice and Hausdorff distance (HD).

SETR (ViT-B/16-pre)=>LD:94.15%, TD:58.84%, LHD:14.19, THD:44.99 CoTr w/o CNN-encoder=>LD:95.73, TD:66.37, LHD:4.67, THD:19.80 TransUnet=>LD:96.08, TD:67.17, LHD:4.56, THD:19.04 CoTr=>LD:96.56, TD:68.92, LHD:4.08, THD:16.93 *LD: Liver Dice, TD: Tumor Dice, LHD: Liver HD, THD: Tumor HD

It reveals that our CoTr consistently surpasses other three methods on both liver and tumor segmentation, demonstrating again the advantages of CoTr over the pure Transformer encoder, pure CNN encoder, and vanilla Transformer.

2.Using more metrics(R1) We used another metric, HD, to measures the degree of mismatch between the predicted boundaries and ground truth. Here are the HD values to be supplemented to Table 1.

SETR (ViT-B/16-rand) =>7.47 SETR (ViT-B/16-pre) =>8.47 CoTr w/o CNN-encoder=>7.23 CoTr w/o DeTrans=>5.18 APSS=>4.85 PP=>5.10 Non-local=>4.70 TransUnet=>4.77 CoTr∗=>5.04 CoTr†=>4.39 CoTr=>4.01

3.Modules are well-established(R2) Despite using well-established modules, our CoTr is the first to explore Transformer for 3D medical image segmentation in a computationally and spatially efficient way. During the research upsurge of applying Transformer to CV, the application to 3D scenarios, e.g., 3D medical image segmentation, is rare due to high complexity. We introduce 3D deformable self-attention via focusing only on a few key sampling points to alleviate the computation. Thus, it is possible for the Transformer to process the multi-scale feature maps and keep abundant high-resolution information for segmentation. The results show that our CoTr significantly beats the competing CNN-based, Transformer-based, and more recent TransUNet models, suggesting our contribution to the field.

  1. Benefit of deformable transformer (DeTrans) over vanilla transformer (VaTrans)(R3) The benefit of DeTrans stems from the 3D deformable attention mechanism, which enables the model to focus on a few sparse positions, instead of all positions (in VaTrans). Thanks to this mechanism, our CoTr can characterize the long-range dependency in multi-scale features, which plays a critical role in segmentation. The superiority of DeTrans over VaTrans has been demonstrated by comparing TransUNet and CoTr in Tab.1 and Fig.2 (in Appendix). We used the same 3D CNN-encoder and decoder in TransUNet and CoTr, but replaced all DeTrans layers with VaTrans layers in TransUNet. Limited by the computation and memory capacity, TransUNet with VaTrans is only able to process single-scale and low-resolution feature maps. By contrast, our CoTr with DeTrans can process multi-scale feature maps simultaneously, resulting in higher accuracy.

  2. Marginally beat TransUnet(R3) We performed the paired t-test to compare CoTr over TransUnet, and obtained a p-value of 0.035 on the BCV dataset. It suggests that the performance gain of our CoTr over TransUnet on this dataset is statistically significant.

6.Performance gain comes from 3D operations(R3) The Dice reported in the TransUnet paper (using 2D pipeline) is only 55.9% for pancreas segmentation, while our reproduced 3D TransUnet obtains a Dice of 81.6% on the same dataset. For fairness, we implemented both TransUNet and CoTr using a 3D pipeline, i.e., using the same 3D CNN-encoder and decoder. The results on BCV (Table 1) and LiTS (supplemented) indicate that our CoTr steadily beats TransUNet, and the performance gain is statistically significant.

7.Decoder in SETR(R3) The best-performing SETR decoder with a progressive upsampling strategy was used.

More details (e.g., meta-information of CT scans, loss, and visualizations) will be provided.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have provided further information on their validation as well as linking them with their novelty, notably on focusing on only sparse position during segmentation. This paper could possibly be noted as an original applications of Transformers in medical image analysis.

    For these reasons, Recommendation is toward Acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper added a deformable transformer to the commonly CNN architecture and shows the benefits. The paper is easy to understand and well written. The main weakness is it is only tested on a small dataset. In rebuttal, the author evaluated on LiTS but only split the training data into train/test. Meanwhile, the testing server is open. Overall, this paper could benefit the MICCAI community and is above the acceptance bar. Recommend to accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    2



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors’ rebuttal successfully addressed concerns regarding dataset, metrics, and also the comparison to the recently proposed TransUnet. I believe an efficient segmentation model based on the transformer network is valuable. I recommend acceptance of the paper. The authors should include discussions presented in the rebuttal in the final version if accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1



back to top