Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Rongtao Xu, Changwei Wang, Shibiao Xu, Weiliang Meng, Xiaopeng Zhang

Abstract

Medical image segmentation is essential for disease diagnosis analysis. There are many variants of U-Net that are based on attention mechanism and dense connections have made progress. However, CNN-based U-Net lacks the ability to capture the global context, and the context information of different scales is not effectively integrated. These limitations lead to the loss of potential context information. In this work, we propose a Dual Context Network (DC-Net) to aggregate global context and fuse multi-scale context for 2D medical image segmentation. In order to aggregate the global context, we present the Global Context Transformer Encoder (GCTE), which reshapes the original image and the multi-scale feature map into a sequence of image patches, and combines the advantages of Transformer Encoder on global context aggregation to improve the performance of encoder. For the fusion of multi-scale context, we propose the Adaptive Context Fusion Module (ACFM) to adaptively fuse context information by learning Adaptive Spatial Weights and Adaptive Channel Weights to improve the performance of decoder. We apply our DC-Net with GCTE and ACFM to skin lesion segmentation and cell contour segmentation tasks, experimental results show that our method can outperform other advanced methods and get state-of-the-art performance.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_48

SharedIt: https://rdcu.be/cyhMr

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a Transformer based method for 2D Medical Image Segmentation. The model consists of two parts, a global context transformer encoder improve the performance of encoder and an adaptive context fusion module for information fusion. The proposed method achieves state-of-the-art results on ISIC 2018 and ISBI 2012 datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The transformer is very popular in the computer vision. This paper introduce the transformer for the medical image segmentation.
2. Multi-scale fusion compensates the local information missing problem. 3.This paper shows extensive experimental results on the ISIC 2018 and ISBI 2012 datasets, and outperforms state-of-the-art approaches.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. According to the ablition study, the effectness of GCTE is not obvious. It only improve approximately 0.001 and we can consider it as fluctuation, 2.The novelty is limited. The introduce of the core module Transformer lacks special design to highlight the advantages.
2. The initialization of Transformer is not clear. Whether it used the pretrained weight of ImageNet.
3. Evaluation metric is monotonous. Dice and Iou, to some degree, evaluate the same.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper can be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The most important thing is that I suggest the author rethink the effectness of Global Context Transformer part or you can design a new part to show the advantage of Transformer.
2. Some existing methods based on the Transformer like TransUNet and MedT can be compared in this paper to show the effectness of your method.
3. I suggest the author to add visualization results of other metheds. 4.The writing and the presentation quality of figures can be improved. I suggest the author show your own result at bottom for better recognization in Fig.4 .
4. I suggest the author cut off section 2 to release more space for more experiment results.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. According to the ablition study, the effectness of GCTE is not obvious. The novelty is quite limited. 2.This paper introduce the popular module transformer but doesn’t show the advantages in the experiments. 3.This paper outperforms state-of-the-art approaches on two datasets.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper targets the problem of exploiting context information for segmentation tasks.

The authors work on a smart approach combining 2D UNet and the recently prevailing topic -Transformers. Overall, this is a well-written paper, which works on depth of context information.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper made a good study on context encoding models. The method covers several prevailing modules such as transformer, SoE, which are used as powerful encoders in benchmarks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Please see more detailed comments Major:

• The literature study lacks the discussion of recent emerged works on transformer-based medical image segmentation tasks. In addition, the discussion of context encoding models are less motivated in this paper. Abstract and introduction mentioned the UNet is “not effective integrated global and scaled context”, which is not convincing because UNet is basically designed for capturing multi-scale features. A figure to highlight the context problem would be better understanding the motivation of this work. • Page 2: related works need to discuss more on transformer based segmentation methods. • Page 2: the approach to fuse multi-scale feature maps at UNet decoder side is previously studied and probably refer to “deep supervision”. This section is confused and lack of motivation. The paper needs to explain why this fusion strategy can obtain context information. • Method section, the method section need a theoretical description on how two context encoding modules (transformer and multi-scale fusion) can capture the context information and justify mentioned claims. To readers, this section seems only covering previous works (UNets, transformer and SoE + deep supervision). • The APW and ACW are hard to catch, these two concepts need to further be explained and highlight this novelty in this paper (comparing to combining transformers).
• Figure 2 seems not required as it is the same with basic transformer concepts. Suggest to just add a citation. • Result section: the authors work on comparisons with many segmentation baselines which are impressive. The improvement seems significant. However, as mentioned above, this method lacks comparisons with transformer-based medical image segmentation state-of-the-arts. • The qualitative figure (fig 4) seems not convincing, improvements are marginal. The qualitative result is better explained adding attention maps demonstrating the proposed method captures better contexts. • Conclusion and discussion: overall, the paper needs further study on capability of capturing context information. Currently, it is not convincing that improvements are related to context encoding. The paper can also discuss the context related to medical applications; as the rationale of medical context is rarely mentioned.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper present fair reproducibility, the Unet and transformer are standard, the adaptive context fusion module needs more clarity on APW and ACW.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

In addtion to major comments: Minor: • Page 3, section 3, may renamed to “Method” instead of “Our proposed Method” • Dice and IoU are similar metrics, I would suggest removing one of them and add Haunsdorff Distance or surface distances as the second metric to study different view of performance.

Overall, this work targets a very important topic on medical context for segmentation. The depth of usage on UNET, transformer and SoE is studied. For further improve this paper, method and results need a basic proof on context encoding capabilities. Medical rationales discussion can be added to improve the paper. More detailed suggestions are addressed along with the weakness section. Thanks.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Comprehensive studies and literatures; problem define and motivation; method description and comparison; novelties; medical applications and meanings
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

In order to improve the CNN’s awareness of context, the current study proposed a Dual Context Network (DC-Net) to aggregate global context and fuse multi-scale context for 2D medical image segmentation. In order to aggregate the global context, the current study present the Global Context Transformer Encoder (GCTE), which reshapes the original image and the multi-scale feature maps into a sequence of image patches, and combines the advantages of Transformer Encoder on global context aggregation to improve the performance of encoder. For the fusion of multi-scale context, the Adaptive Context Fusion Module (ACFM) was proposed to adaptively fuse context information by learning Adaptive Spatial Weights and Adaptive Channel Weights to improve the performance of decoder. The DC-Net was applied to skin lesion segmentation and cell contour segmentation tasks, experimental results show that this method can outperform other advanced methods and get state-of-the-art performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The current study proposed a Dual Context Network (DC-Net) to aggregate global context and fuse multi-scale context based on U-Net for 2D medical image segmentation. Compared with previous studies, DC-Net combined with the advantages of self-attention mechanisms and the method of integrating multi-scale context information, which can make U-Net gain stronger context awareness. A GCTE block based on the VIT was proposed to aggregated global context information by modeling the long-range dependence between pixels, and linear projection and patch embeddings were preformed to multi-scale feature map to further stimulate the global context modeling ability of transformer. Meanwhile, an ACFM block was proposed to adaptively fuse context information by learning Adaptive Spatial Weights and Adaptive Channel Weights to improve the performance of decoder, which can integrate feature maps with different receptive fields in a reasonable way. The combination of different methods of aggregating global and local feature maps and the improvement of self-attention mechanisms and method of integrating multi-scale context information are enlightening.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Although the introduction of GCTM and ACFM has improved the segmentation efficiency of U-Net, the structure of GCTM and ACFM is relatively complex, will it introduce too many parameters and lead to high network complexity?
2. The mean of the parameters in the formulas should be explained clearly, such as equation (1).
3. In section 4.1, ‘We augmented all 30 images of the ISBI training set to obtain 300 images. We use 240 of them as the training set and 60 as the test set.’ What method was used for augmentation? Are you sure that the data augmented from the same original data were not in both the training set and the test set?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code of this work was not provided while the data set used for train and test is the public data set. The reproducibility is general.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
The current study proposed a Dual Context Network (DC-Net) to aggregate global context and fuse multi-scale context based on U-Net for 2D medical image segmentation. Compared with previous studies, DC-Net combined with the advantages of self-attention mechanisms and the method of integrating multi-scale context information and improved them, which can make U-Net gain stronger context awareness. The DC-Net was applied to skin lesion segmentation and cell contour segmentation tasks, experimental results show that this method can outperform other advanced methods and get state-of-the-art performance. Comments:
1. The complexity of DC-Net is suggested to be provided in the article, which is an important factor to evaluate the performance of a segmentation method.
2. The reason of setting of P should be explained clearly in section 4.1, and the change of the segmentation performance brought by the change of P value also should be explained.
3. The mean of the parameters in the formulas should be explained clearly. 4.The detail of MHA and MLP in Figure.2 should be shown in the sub diagrams.
4. There are several spelling mistakes in the article, please check carefully.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The current study explored current methods of retaining contextual information and integrating global and local features, and utilized two popular methods and improved them to proposed a Dual Context Network (DC-Net). The proposed method is innovative and described clearly. Thus, I suggest receiving it after modification.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviews of this work are quite divergent. The authors are suggested to provide a rebuttal to clarify main issues raised by reviewers, i.e. the motivation is unclear and the evaluation is insufficient.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We thank all the reviewers for the constructive comments. We present our responses to main concerns. Q1 of R1, R2: Lack of comparison with the latest transformer-based segmentation method? A1: Considering that arXiv papers that have not been peer-reviewed should not be directly compared, we did not add the results of comparison with MedT in the paper (according to the requirements of the review process on the official website of MICCAI). But we discuss more transformer-based segmentation methods in the related work. Our Global Context Transformer Encoder (GCTE) adopts Multi-scale Feature Serialization, which is different from existing methods, such as TransUNet that directly applies VIT. Unlike TransUNet, we did not use the pre-training weights of ImageNet in the initialization of the transformer. Under the same settings for ISBI 2012 dataset, we have achieved extremely competitive performance compared with the latest MedT method. Specifically, MedT’s Dice score is 0.9455, while our DC-Net’s Dice score is 0.9634. Our method not only uses transformer (our GCTE) to aggregate global contexts, but also employs the adaptive context fusion module (ACFM) for the fusion of multi-scale contexts. ACFM adaptively fuses context information by learning adaptive spatial weight (APW) and adaptive channel weight (ACW) to improve the performance of the decoder. To improve the context awareness of CNN, we propose a complete solution called Dual Context Network (DC-Net) that combines GCTE and ACFM.

Q2 of R1, R2: The improvement of ablation research is limited and the evaluation metric is monotonous? A2: The misunderstanding may have been caused by the unclear expression of our form. For the ISBI 2012 dataset, our Backbone + GCTE’s Dice score increased by 2.5% compared to the baseline, and the IOU increased by 5.1%, both of which are better than the direct application of VIT on Backbone (Backbone + VIT). Under a relatively high benchmark (Dice is 0.9382), the Dice of our DC-Net is also 2.7% higher than the baseline, and the IOU is improved by 5.5%. The same trend is also true on the ISIC 2018 dataset. Compared with ISIC 2018, ISBI 2012 requires more contextual information due to complex boundaries, meaning that DC-Net improves it even more. Thanks for reminding the monotonicity of our metric, we will add the comparison of the average Hausdorff Distance and ACC in the final paper. Specifically, the average Hausdorff Distance of our DC-Net is more competitive than that of advanced Inf-Net and CE-Net on ISIC 2018 (27.85mm vs 32.34mm vs 37.82mm).

Q3 of R1, R2, R3: The expression in the paper is not clear and the attention map is added? A3: Thanks to all the reviewers for their time and feedback. We will correct the grammatical and layout issues mentioned in the final version. We address the individual points below, and will be more explicit on all of these in the final version.

Based on the description of DeeplabV3 and CE-Net in the Introduction, we will supplement the discussion of the context coding model, and supplement the background related to medical applications.

We will explain the reasons for setting P and the changes in segmentation performance that it brings in the Implementation Details.

We will add attention visualization to better show that our method can capture better context.

We will introduce APW and ACW in detail in the paper.

For the ISBI dataset, we apply simple data augmented methods such as flipping and random rotation. We will address all remaining minor suggestions in the final revision.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have partially addressed the reviewers’ concerns, i.e. transformer-baselines, ablation study. However, the novelty is still limited. A variety of models based on Transformor have been proposed in the literature. Authors propose a complex structure, however, the experiment doesn’t show the advantages.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

In this work, the authors propose a Dual Context Network (DC-Net) to aggregate global context and fuse multi-scale context for 2D medical image segmentation. The main novelty is two new modules: GCTE and ACFM, with aiming to aggregate the global context and r the fusion of multi-scale context. Overall, the rebuttal is well addressed most of the concerns raised by the reviewers. After reading the paper, I feel the concept is interesting, the experiment is solid, and results are promising. I recommend accepting this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors followed the recent trend of using transformers. They combined transformer with U-Net and propose a multi-scale fusion module for segmentation. The rebuttal helped to clarify some points raised by the reviewers, such as effectiveness of GCTE and related works on transformer-based segmentation. They promise to add some details about the method in the final version.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

back to top

DC-Net: Dual Context Network for 2D Medical Image Segmentation