Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yuanfeng Ji, Ruimao Zhang, Huijie Wang, Zhen Li, Lingyun Wu, Shaoting Zhang, Ping Luo

Abstract

The recent vision transformer (i.e.forimageclassification) learns non-local attentive interaction of different patch tokens. However, prior arts miss to learn the cross-scale dependence of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), which incorporates rich feature learning and semantic structure mining into a unified framework. Specifically, MCTrans embeds the multi-scale convolutional features as a sequence of tokens, and performs intra- and inter-scale self-attention, rather than single-scale attention in previous works. In addition, a learnable proxy embedding is also introduced to model semantic relationship and feature enhancement by using self-attention and cross-attention, respectively. MCTrans can be easily plugged into a UNet-like network, and attains a significant improvement over the state-of-the-art methods in biomedical image segmentation in six standard benchmarks. For example, MCTrans outperforms UNet by 3.64%, 3.71%, 4.34%, 2.8%, 1.88% and 1.57% respectively.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_31

SharedIt: https://rdcu.be/cyhL9

Link to the code repository

https://github.com/JiYuanFeng/MCTrans

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    1, The transformer is used to segment the 2D medical image. 2, A novel learnable proxy embedding is introduced to enhance feature representation. 3, The model is evaluated on six datasets. The experiments show that MCTrans outperforms state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A novel learnable proxy embedding is introduced to enhance feature representation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some part of the paper does not describe clearly, such as : 1, On page 4, the first paragraph, the fourth sentence, the number of tokens is different in different scales, but the final token is the sum of different scale tokens. How to sum such token? 2, On page 4, the first paragraph, the last sentence, “where N is the number of categories of the dataset”, what does the N indicate? 3, The structure of the proxy embedding does not describe clearly.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There are many implements about the transformer, hence I think the method is easy to implement by other people.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    1, You may describe the detail clearly. 2, Transformers always consume much more computation resources. If you can extend your model to a 3D medical image, it will become a more great work.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiment is convincing, but the novelty is limited, and some details do not described clearly.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    This paper improved transformer based segmentation method by (1) introducing the Transformer-Self-Attention (TSA) module to achieve cross-scale pixel-level contextual modeling via the self-attention mechanisms; (2) developing the Transformer-Cross-Attention (TCA) to automatically learn the semantic correspondence of different semantic categories by introducing the proxy embedding; (3) using such proxy embedding to interact with the feature representations via the cross-attention mechanism, which could effectively improve feature correlations of the same category and the feature discriminability between different classes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clearly listed the bottlenecks of current SOTA segmentation methods (both CNN based and transformer based) and introduced their method to make improvement. The designed MCTrans can be plugged in and use in different models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper used many mathematical notations to explain their method, but didn’t provide the definition for all of them. Also some concepts might be self explain for the people who are familiar with the transformer, but not clear for the rest. It is better to explain it in the paper and provide references, for example, key, query, positional embedding

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper has good reproducibility, they listed their data source, data preparation, training schedule and hyper-parameters; also they will release the code soon.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    In the part 2 related work transformer section, the highlights of other people’s work can be more specific. Since most of the readers are probably in medical imaging area, so citing more medical image related papers could be more beneficial. Fig. 3 please clarify the notations in the figure caption. Section 3 Please specify what is Epos positional token and it is computed. Section 3 the authors wrote: “The output enhanced tokens are further pass through the TCA module and interact with the proxy embedding Epro, where N is the number of categories of the dataset.” Please specify which N is referring to.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduced a popular concept in computer vision, transformer, to address the segmentation challenges in biomedical image segmentations. The highlight of this paper is the authors not only clearly introduced their proposed method, also they presented the bottlenecks of current SOTA and why transformer can help with this areas, and what modifications can be made to customize it in biomedical image domain. These discussions can help other researchers tackle their own challenges.

    But the paper at the same time didn’t clearly explain the concepts they used or provide a reference for the people who is not familiar with it, like positional embedding.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The authors propose a logical development of semantic segmentation model integrating conventional CNN- and trending Transformer-based modules. The suggested modification comprises the two blocks - transformer-self-attention and transformer-cross-attention - that are supposed to improve the learning of (I) cross-scale patterns over a wide range of spatial distances and (II) semantically-conditioned patterns. The proposed modifications are implemented in two settings (with VGG- and UNet-based encoders) and compared to popular UNet-based architectures, as well as a recent TransUNet. Training and evaluation are done on 6 datasets, with the major part of the analysis done on multi-class Pannuke data. The results show consistently higher performance with the proposed method, providing new insights into the combined UNet+Transformer methods and suggesting direction for further work.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Well-written and dense manuscript. Very good introduction to the domain and related work;
    2. Clear justification of the proposed changes;
    3. Rich experimental work: (I) comparison to the competing prior art, (II) two backbones explored, which improves the statistical significance of the findings, (III) ablation studies with respect to the proposed bloocks and their quantities;
    4. Balancing of the compared models parameter- and FLOPs-wise;
    5. Deformable self-attention is implemented to limit the computational costs, which is an important issue in combined models;
    6. Comprehensive evaluation on 6 datasets. Clear visual support for the findings;
    7. The code is open source.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It is not certainly clear what the authors imply by “cross-validation”. For Pannuke, the original article says that the data is organized into “3 randomized training, validation and testing folds for a fair model comparison“, suggesting a hold-out evaluation. The authors say, however, that they used “officially divided 3-fold cross-validation”. Importantly, “3-fold cross-validation” by definition does not imply any testing data. Even assuming that the subsets for Pannuke are the same as in the original paper, the question is open then about the remaining datasets. The authors “report the 5-fold cross-validation results”, which are presumably obtained from 5 models in out-of-fold prediction setting. Since the model selection is typically done on validation data, the presented scores do not represent the true generalization of the trained models and are prone to overfitting.
    2. It is not clear how exactly the authors tweaked the models in Table 3 to roughly have the same number of parameters, i.e. how exactly the UNet in the 1st row is different from the UNet in the 6th row (the same applies to other models);
    3. No ablation study with respect to an auxiliary loss.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The article provide sufficient high-level overview of the method and how it can be reimplemented. Many details on the compared architectures are missing (presumably due to the page limit), but those should be available from the provided source code. The scores are reported only as mean (no std.dev). The statistical analysis is not conducted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    In addition to the above:

    1. Page 4: “interact with the proxy embedding E_pro, where N is the number”. N is missing from the notation. Furthermore, N is later used as a “number of transformer block”. It is recommended to revise the notation.
    2. The authors “VGG-style”, while ResNet-34. If applicable, the text should mention which VGG- model (-11/-19/etc) was used;
    3. Please, mention whether the backbones were used pretrained or randomly initialized;
    4. “Sensitivity to the Setting” paragraph should refer to Table 2, not Table 1;
    5. Please, revise the punctuation throughout the manuscript, particularly, around the Figure and Table references;
    6. Please, clarify on how the models were tweaked for Table 3 in order to have a matching number of params (see also my comment above).
  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Well-writen manuscript with sufficient review of the prior art;
    • Clear rationale behind the proposed modifications;
    • Comprehensive and targetted experimental work supporting the hypotheses. Training and evaluation on multiple datasets;
    • The results produce new insights into the combined CNN+transformers models and potential directions for further improvements;
    • Open source code, which is particularly valuable for such methodologically-oriented articles.
  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Although reviewers have some negative concerns about this paper, including the unclear description of proxy embedding structure, incorrect annotations and writings, reviewers approve this work by applying transformer for medical image segmentation, especially the novel proxy embedding structure. The novel transformer models can be applied in other deep learning framework and in other image segmentation tasks. This paper is well-written.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    2




Author Feedback

We sincerely thank the reviewers for their comments. We respond to the main concerns in the following aspects. We will release the source code and model for reproducibility. Q1-R1: Some part of the paper does not describe clearly? Thanks, the multiscale token L is not the sum of different scale tokens, but the set of them. We apologize for this formulation error and we will correct it in the updated version. Besides, for the “where N is the number of categories of the dataset”, the N denotes the total number of labeling categories of the dataset, for example, for the polyp segmentation, N equals 2. For the structure of proxy embedding, Structurally, a proxy embedding is a feature/tensor with a specific dimension (128). Q2-R1:If you can extend your model to a 3D medical image, it will become a more great work. Thanks, it is our future work to extend a sparse transformer for 3D data. Q3-R2: Provide an essential introduction of key components of the transformer. Thanks for your suggestion, we will extend notation/related paper to help explain our method in the revised paper. Q4-R3: It is not certainly clear what the authors imply by “cross-validation”. We adopt standard k-fold validation. For the 5 fold cross-validation: 1. Divide all the data into 5 parts randomly. 2. Take one of them as the test set each time without repeating, and train the model with the other four as the training set, and then calculate the performance of the model on the test set. 3. Average the performance of the 5 times to get the final performance. Q5-R4: No ablation study with respect to an auxiliary loss. Thanks. Please refer to the 6th row in table 1.



back to top