Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Andriy Myronenko, Ziyue Xu, Dong Yang, Holger R. Roth, Daguang Xu

Abstract

Multiple instance learning (MIL) is a key algorithm for classification of whole slide images (WSI). Histology WSIs can have billions of pixels, which create enormous computational and annotation challenges. Typically, such images are divided into a set of patches (a bag of instances), where only bag-level class labels are provided. Deep learning based MIL methods calculate instance features using convolutional neural network (CNN). Our proposed approach is also deep learning based, with the following two contributions: Firstly, we propose to explicitly account for dependencies between instances during training by embedding self-attention Transformer blocks to capture dependencies between instances. For example, a tumor grade may depend on the presence of several particular patterns at different locations in WSI, which requires to account for dependencies between patches. Secondly, we propose an instance-wise loss function based on instance pseudo-labels. We compare the proposed algorithm to multiple baseline methods, evaluate it on the PANDA challenge dataset, the largest publicly available WSI dataset with over 11K images, and demonstrate state-of-the-art results.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_32

SharedIt: https://rdcu.be/cyman

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

A self-attention Transformer blocks is proposed to be embedded in model to capture dependencies between instances.

An instance-wise loss function based on instance pseudo-labels is proposed to increase instances’ labels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-This paper proposed to explicitly account for dependencies between instances during training by embedding self-attention Transformer blocks.

-It generated instance-level pseudo label to add instance level loss supervision. -The weight attention selection strategy improved the quality of pseudo labels.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1. The use of transformer in the article is not new, many similar methods have been proposed both in natural image and medical image[1,2,3,4]. The difference exists in the variant of a deeper integration of the transformer with the backbone CNN, but there is still only common operation. 2. The experimental part contains experiment compared with other methods and ablation study but no analytical corroboration experiment. For example, replace transformer with non-local, the model can also learn the relationship between instances, but there is no experimental comparison. 3. This paper proposed a method to improve the performance of model, but lacks the mechanistic explanation of why the performance of the model is improved.

[1]Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [2]An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale [3]End-to-End Object Detection with Transformers [4]Pre-Trained Image Processing Transformer
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility is good because the description of the method details is very clear
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please further explain the difference between adding transformer and classic non-local. And further elaborate on the novelty of the article, and add analytical experiments to support the author’s point of view.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Although this article proposed a model that did not exist before the WSI task, and proved the effectiveness of its method through experiments, the analysis of the experiment was not sufficient.
- At the same time, the novelty of this paper is limited.
- The proposed method is similarity to other methods already exists, for example it is similar to the transformer usage in natural images and pseudo-label usage in semi-supervised images.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

6
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper used embedding self-attention Transformer blocks to capture dependencies between instances for explicitly account for dependencies between instances, which is able to make better use of the spatial information between patches. And Instance-wise loss based on instance pseudo-labels has also been developed for conquering gradient vanishing.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper is a novel applicatio introduced transformer into MIL to capture dependencies between instances for explicitly.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The problem of how transformer accounting for dependency among instances not thoroughly analyzed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Since the code will not be available in the reproducibility checklist. However, the details of model have been provided in the paper, it may reproduce the results of the paper
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper evaluated the transformer module can help improve the performance of deep learning-based MIL for WSI classification. Whie it should give more discussion on how transformer capture dependencies between instances and more ablation experiments should perfomed to evaluate the instance-wise loss.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Experiments results gives the best performance compared with other MIL methods and have been top three winning teams of the PANDA Kaggle challenge.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The paper highlights a common weakness in the multiple instance learning framework which is that instances are normally assumed to be independent. To address this, the authors propose using transformer encoding blocks to achieve cross instance communication. The authors also propose using pseudo-labels during training to improve signal for training the neural network.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This method of obtaining cross instance communication is novel and works well as is demonstrated by the experiments section. The paper is well written and easy to understand. The use of transformers is well motived by the literature review.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Pseudo-labeling appears to only seems to add marginal improvements to the results, an analysis deeper than just the QWK would be interesting.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The algorithm is relatively complex given the additional step of constructing pseudo labels but the majority of hyper-parameters are given
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

More analysis on why pseudo-labeling is useful during training would be interesting. Additionally, confusion matrices for the different models would also add more clarity then just the QWK.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The contribution is interesting to the community and well presented, the results show the model works better than other approaches.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper integrates visual transformer within a multiple instance learning framework to model the interdependence of the instance. Additionally, it also proposes a pseudo instance labelling scheme or conquering gradient vanishing. The paper is generally well written, with a clear explanation of the methodology motivation, network architecture, experimental results and ablation studies. The evaluation dataset is relatively large and comprehensive and the comparison is also fair (particularly to other top-ranked teams are convincing). The paper integrates visual transformer within a multiple instance learning framework to model the interdependence of the instance. Additionally, it also proposes a pseudo instance labelling scheme or conquering gradient vanishing. The paper is generally well written, with a clear explanation of the methodology motivation, network architecture, experimental results and ablation studies. The evaluation dataset is relatively large and comprehensive and the comparison is also fair (particularly to other top-ranked teams are convincing). Yet R1 and R2 has some concerns about how transformer account for dependency among instances (see R1 &R2’s main weakness) and this needs to be addressed in the rebuttal. Additionally, given the complexity of the method, I think release code will greatly improve its reproducibility.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We’d like to thank all reviewers for their valuable comments. We are grateful for their acknowledgement of the strengths and contributions of our paper. As pointed out by R3 “This method is novel and works well as is demonstrated by the experiments section”, R2 “This paper is a novel application to capture dependencies between instances”.

We’d like to address how the “transformer account for dependency among instances”: Transformers were initially introduced to capture long range dependencies between words in sentences [1] and later applied to vision [2]. Whereas traditional convolutions are local operation, the self-attention block of Transformers computes attention between all combinations of tokens at a larger range directly.

R2 asked to analyze ”how transformer accounting for dependency among instances”, R1 asked to explain “why the performance of the model is improved”: We have inspected the self-attention matrices (inner product of Q and K, normalized). We found that for many tested WSI pathology images, the self-attention matrices have distinct off-diagonal high value elements, indicating the higher weight for certain combinations of instances. In particular, instances with WSI tumor cells of different Gleason scores have higher off-diagonal values, indicating that such a combination is valuable for the final classification, which was captured by the transformer self-attention. We’ll add the visualization of the self-attention matrices to the supplementary materials.

R1 asked to “replace transformer with non-local blocks to learn the relationship between instances”: We have re-implemented these blocks based on the official paper [3] implementation, and re-ran the experiments for various Non-Local (NL) block configurations. We considered 4 NL block types from the reference paper (Embedded Gaussian, Gaussian, Dot product and Concatenation). We have replaced the Transformer encoder blocks in our network with the NL blocks (at the end of encoder) and computed the QWK scores:

NL EmbedGauss (0.947±0.035) NL Gauss (0.952±0.065) NL DotProd (0.951±0.043) NL Concat (0.943±0.023) Attention MIL(0.948±0.036) Transformer MIL [ours] (0.960±0.034)

In all tests the NL blocks performance was similar only to the baseline Attention MIL method, whereas our method demonstrated noticeable improvements.We’ve also tried stacking NL blocks (4x) and embedding them in a feature pyramid, but it didn’t lead to any improvements.Even though, as pointed in non-local NN paper [3], Self-attention can be considered as one of the forms of Non-local operation (Embedded Gaussian), there are differences of its implementation in Transformers ( it’s multiheaded, and it’s followed by MLP and layernorm). We’ll add these (more detailed) additional ablation experiments to the supplementary materials.

While Reviewer 1 is criticizing the limited novelty of our paper, Reviewers 2 and 3 are acknowledging our novel contributions, e.g. “this method is novel and works well as is demonstrated by the experiments section”. To the best of our knowledge, our work is the first work to introduce Transformers to obtain cross instance communication in Multiple Instance Learning, as well as the first work of its kind for pathology Whole Slide Imaging task and medical images in general. Our work also includes a novel integration of the transformer block at various levels of the feature pyramid of the backbone network. Finally, our paper considers dependencies between large image regions (224x224px); in comparison, the vision transformer [2] uses tiny 16x16px patches, which mostly represent low level features. In our work each instance includes a complete picture, a different cancer cell group. The dependency we want to capture is thus between large complete regions of several tumor subtypes. [1] Attention Is All You Need: Vaswani et al. [2] An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale: Dosovitskiy et al. [3] Non-local Neural Networks: Wang et al.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have nicely addressed the major concerns of the reviewers and I thereby recommend the paper acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposes to use transformers and attention to model dependencies between instances for multi-instance learning. The transformer is integrated with a backbone CNN to model inter-instance dependencies. Evaluation is on a large whole slide image (WSI) dataset for prostate cancer. Good performance of the proposed method is reported. There were some concerns that were raised during the review, in particular, related to the performance of non-local approaches and what the transformer approach captured / how it worked. These were, in my opinion, well addressed in the rebuttal. While there was some criticism regarding novelty in the reviews, the approach has novel aspects and shows compelling results which is appreciated.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper introduces Transformer blocks into a Multiple Instance Learning framework for Whole Slide Imaging analysis. It presents strong experimental results on the PANDA Kaggle challenge. The rebuttal adequately addresses the main concerns expressed by the reviewers. Experimental details and ablations were included, and an extended motivation to use Transformers to model instance dependencies. The only missing reference pointed out by R1 with medical images is an unpublished preprint. The other three references should be discussed in the related work review. In summary, sufficient technical novelty and competitive results on a public benchmark are adequate for acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

back to top

Accounting for Dependencies in Deep Learning based Multiple Instance Learning for Whole Slide Imaging