Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yundong Zhang, Huiye Liu, Qiang Hu

Abstract

Medical image segmentation - the prerequisite of numerous clinical needs - has been significantly prospered by recent advances in convolutional neural networks (CNNs). However, it exhibits general limitations on modeling explicit long-range relation, and existing cures, resorting to building deep encoders along with aggressive downsampling operations, leads to redundant deepened networks and loss of localized details. Hence, the segmentation task awaits a better solution to improve the efficiency of modeling global contexts while maintaining a strong grasp of low-level details. In this paper, we propose a novel parallel-in-branch architecture, TransFuse, to address this challenge. TransFuse combines Transformers and CNNs in a parallel style, where both global dependency and low-level spatial details can be efficiently captured in a much shallower manner. Besides, a novel fusion technique - BiFusion module is created to efficiently fuse the multi-level features from both branches. Extensive experiments demonstrate that TransFuse achieves the newest state-of-the-art results on both 2D and 3D medical image sets including polyp, skin leison, hip, and prostate segmentation, with significant parameter decrease and inference speed improvement.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_2

SharedIt: https://rdcu.be/cyhLu

Link to the code repository

https://github.com/Rayicer/TransFuse

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors propose to combine vision transformers and CNN in a parallel fashion to effectively model long range relations for medical image segmentation and a fusion module for combining features from the two branches. The models were developed using pre-trained CNN and vision transformer networks as backbone for three different applications. Performance is reported for different sizes of backbone models and ablation experiments to understand the effect of parallel vs serial and fusion module are provided. The developed models have slightly improved performance and reduced inference speed than SOTA models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Good summary of the different components of the proposed model
- evaluation on multiple datasets and comparison with many SOTA approaches
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The improvement in performance and inference speed is marginal compared to the serial configuration
- The authors claim that the proposed model does not need deeper architectures, however only the last deeper set of features (H/32, W/32) is removed
- Low level features seem to be ignored in the proposed model. It is unclear as to how this affects segmentation of smaller lesions / objects
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

There is sufficient implementation details to reproduce the model if needed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Given that CNNs and vision transformer have been used in a serial configuration and achieved better performance than Unet based models, is there a reason to believe a parallel configuration will perform better than the serial configuration and why?
- The listed advantage (2) of not building very deep nets seems to be driven by patch size chosen in the vision transformer and the proposed model is only one level shallower than the base resnet models. This is misleading as the proposed model is not a shallow network.
- Transformer branch: is L the same the as the original paper, L=6? The description provided is from ref [7] but the base transformer model is from ref [22], which adapts/replace the MLP layer with a feed forward network. A brief description of the difference of the transformer models in ref [7] and [22] will be helpful as both are used as base models.
- CNN branch: are the features from first block ignored? If so, how does it affect the segmentation quality especially for smaller objets?
- The bifusion module is heavily engineered with channel attention for transformer features, spatial attention for CNN features, dot product to fuse the CNN and transformer features and gated skip attention for the concatenated features. A simple concatenate and residual connections for fusing the transformer and CNN features provides nearly equivalent performance (Table 4). Can this module be simplified? An ablation study on the effect of individual components of the bifusion module will help in explaining the need for the various components.
- Data acquisition: Why were these particular datasets chosen to evaluate the proposed model? What was the original resolution for polyp and skin lesion datasets? What is the sample size for different polyp datasets? This would be helpful to interpret the results presented in Table 1. Consider cross validation splits for smaller datasets (hip segmentation). -Implementation details: What is the training time for TransFuse S, L and L* models? -Polyp segmentation: The performance of HarDNet-MSEG is comparable to TransFuse-S in Kvasir and ClinicDB (datasets used for training). How does pre-taining on a larger dataset contribute to the improved generalizaton of all transformer based models to ColonDB, EndoScene and ETIS datasets? Please provide the model size for TransUnet, TransFuse-L and TransFuse-L*
- Ablation Study: Rationale for E.2 is unclear as Resnet and VGG16 are probably extracting similar / redundant features. Consider providing the performance of CNN alone and transformer alone models.
- Fig 2: Please provide a diverse set of illustrative examples. All lesion examples contain only one lesion though of different sizes. Are these cases with multiple lesions?
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed model improves upon the performance of the serial configuration marginally and the improvement in inference speed is not drastic. As these models are deployed over the cloud typically and not on edge devices, the slight reduction in parameters and inference speed might not affect user experience.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The author proposed a fusion technique (BiFusion) among featuremaps captured through Transformer and CNN in a parallel manner. Extensive experiments were performed on three different medical image segmentation datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors explored and exploited transformer-alike architectures on medical segmentation tasks.
- Extensive experiments were demonstrated on different medical image segmentation tasks, including polyp, skin lesion, and hip segmentation.
- Overall, the paper is well-written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The contribution of each Transformer/CNN branch was not explicitly and visually justified. It would be great if more qualitative analysis could be added to demonstrate the contributions of each branch, for example, by inserting more contents in Fig. 2.
- In terms of method, the design of BiFusion module is mostly engineering or system building built on slightly weak motivations, using several existing attention mechanisms.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The release of source code is not mentioned in the manuscript.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Apart from the aforementioned qualitative analysis, adding more detailed ablation studies of each CNN/Transformer branch would help the readers to get a clue about their individual contribution towards the final segmentation results.
- The design motivation of the proposed BiFusion module should be enhanced.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Both of the two weaknesses points mentioned in Sec. 4.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

TransFuse combines Transformers and CNNs in a parallel style, with which both global dependency and low-level spatial details can be efficiently captured in a much shallower manner. Besides, a novel fusion technique - BiFusion module is created to fuse the multi-level features from both branches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Potentially low GPU memory requirements: “Extensive experiments demonstrate that TransFuse achieves newest state-of-the-arts on different medical image sets including polyp, skin lesion and hip segmentation, with up to 20% parameter decrease and 15% inference speed (FPS) improvement compared to CNN-based methods.”
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Experiments are only on 2D image: Most medical images are 3D images, but the author only implement their experiments on 2D medical image segmentation tasks. Considering 3D format is one most different feature between medical image (CT, MR, X-Ray, etc.) and natural image, the author should include 3D medical image segmentation tasks to show TransFuse outperforms other SOTA results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I don’t know the reproducibility of the paper and I can’t believe the huge result gap between U-Net and TransUnet/TransFuse.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Implementing experiments on 3D medical image segmentation tasks and comparing with nnUNet (https://github.com/MIC-DKFZ/nnUNet/) to show its real performance.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The results are all implemented only on 2D, which is not typical in medical image segmentation tasks.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper combines two parallel branches including CNN and vision transformers for medical image segmentation. A fusion branch is designed to integrate features from these two branches. Experimental results on three different 2D segmentation tasks demonstrate consistent better performance in comparison with other methods. However, further justification and comments are required for major issues raised by reviewers. For example, the improvement in performance and inference speed is marginal compared to serial configuration and SOTA methods. Further justification on the performance improvement and motivation for parallel configuration design is required. The efficacy of the proposed method with shallow architectures and how the low-level features can be handled properly. The motivation of BiFusion and whether it can be simplified. The extension and efficacy of proposed method to volumetric medical images.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

Sincere thanks for all the insightful feedback! For conciseness, we use “CC” as “common comment”, “R” as “reply”, “r” as “reviewer”, “mr” as “meta reviewer”, “Transformer” as “TFM”. The structured response is provided as follows:

CC1(mr; r1: 6.5; r2: 3.2/6.2) The design motivation of BiFusion Module and if it can be simplified

R1:The motivation is explained in Sec.2 BiFusion Module. Essentially, 1)spatial attention, which can be viewed as a spatial filter, is to smooth low-level but noisy CNN features while preserving fine-grained details; 2)channel attention is to promote information related to the current TFM patch from a wealthy global context; 3)dot product is to model the cross dynamics between the two branches. As suggested by r1, an additional ablation study of BiFusion is conducted on the skin lesion task, which is more challenging and has more test samples compared to Polyp in Tab.4. The table below starts from naive concatenation and each row adds a BiFusion component. As each component shows its unique benefit and requires little computational cost, we believe this should not be simplified.

Fusion IoU Dice Concat 0.778 0.857 +Spatial Attn 0.782 0.861 +Channel Attn 0.787 0.865
+Dot Product 0.795 0.872

CC2(mr; r3: 3/6) Performance on 3D medical data

R2:As suggested by r3, we extended the experiments to the Prostate Multi-modality MRI Segmentation task hosted in Medical Segmentation Decathlon, where nnUNet ranked 1st. Under the same setting, the 5-fold cross-validation results of TransFuse-S are shown below (could be added to the final version):

Model nnUNet-2d nnUNet-3d_fullres TransFuse-S mDice 0.7333 0.7537(+2.78%) 0.7639(+4.17%) #Params(M) 29.97 44.80 26.30 Inference(vol/s) 0.21 0.38 0.19

where TransFuse-S surpasses nnUNet-2d by a large margin. Compared to nnUNet-3d, TransFuse-S achieves better performance while reducing parameters by 41% and increasing speed by 50%.

CC3(mr; r1: 3.1/8) Is the improvement of TransFuse marginal? Explain the significance of TransFuse

R3:We believe TransFuse shows non-marginal performance improvement: 1)as discussed in the paper, parallel execution of both branches increases inference speed, e.g., by 20% on PolyP while improving generalization on unseen set by a large margin (5.2% as in Sec.3 and 3.2% as in Tab.4 E.4 vs E.1); 2)in the skin lesion task, our performance improvement is on par with prior works; 3)TransFuse-S also shows large improvement on 3D data (see in R2). With significantly better efficiency, our solution can be deployed at both edge and cloud, which is essential to meet the growing demands in preventive, personalized and privacy-preserving edge healthcare. Also, TransFuse is the 1st parallel-in-branch model synthesizing CNN and TFM.

CC4(mr; r1: 3.2/6.2) Is the net used in TransFuse shallow?

R4:Comparing the backbones of TransFuse-S (28 CNN + 8 TFM layers) to previous CNNs, e.g. HardNet has 68 CNN layers, TransFuse-S is much shallower. Besides, TransFuse removes the last block of the CNN branch, which typically contains most of the params. TransFuse achieves promising performance and is more efficient, which validates the effectiveness of our design.

CC5(mr; r1: 3.3/6.4) TransFuse on low-level features

R5:The low-level features are captured by the CNN branch and fused in later stages. In fact, TransFuse-S performs well on small skin lesions: on the 100 lesions with the smallest area in our test set, TransFuse-S achieves a Jaccard of 0.792, which is comparable to 0.795 in Tab.2. The 1st level CNN features are not used as typically done in SOTA models, e.g. HRNet, PraNet, which does not affect the performance much.

CC6(r1: 6.9; r2: 6.1) Performance of each branch alone in TransFuse

R6:On Polyp, DeiT-S alone achieves mDice of 0.889(Kvasir) and 0.727(ColonDB); R34 alone gets 0.890(Kvasir) and 0.645(ColonDB). As stronger baselines are shown in Tab.4: E.1 and E.2, we didn’t provide single branch performance in the original paper.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed most of the concerns raised by the reviewers in the rebuttal and other minor ones could be addressed in the final version. Therefore, I am happy to recommend accept.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The additional two experiments conducted during the rebuttal period are impressive, especially considering the short rebuttal period. For the additional experiments, the proposed method demonstrates a large improvement to the state-of-the-art nnUNet method on the 3D Prostate MRI segmentation dataset. These results illustrate the efficiency and generalizability of the proposed method. Unfortunately, there are still some concerns which were not properly addressed during the rebuttal. 1) As indicated by the Reviewer 1, the proposed method shows incremental improvement to the state-of-the-art in Table 1 and Table 2. It’s relatively difficult to understand whether these results are significant; 2) As indicated by the Reviewer 2, the design motivation of BiFusion module and the parallel configuration is still unclear. As indicated by Reviewer 2, the design of BiFusion seems to be a minor medication of the existing attention methods. We appreciate that authors explained the main components of the BiFusion module. However, their main differences to the existing attention methods are still not clear.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

18

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have thoroughly justified the main concerns summarized by MR through additional experimental data and included quantitative results validating their design choice and method’s performance improvement. With these additional justifications and quantitative results (must include in camera read), the paper is recommended for MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

back to top

TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation