Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Hang Li, Fan Yang, Yu Zhao, Xiaohan Xing, Jun Zhang, Mingxuan Gao, Junzhou Huang, Liansheng Wang, Jianhua Yao

Abstract

Learning informative representations is crucial for classification and prediction tasks on histopathological images. Due to the huge image size, whole-slide histopathological image analysis is normally ad- dressed with multi-instance learning (MIL) scheme. However, the weakly supervised nature of MIL leads to the challenge of learning an effective whole-slide-level representation. To tackle this issue, we present a novel embedded-space MIL model based on deformable transformer (DT) architecture and convolutional layers, which is termed DT-MIL. The DT architecture enables our MIL model to update each instance feature by globally aggregating instance features in a bag simultaneously and encoding the position context information of instances during bag representation learning. Compared with other state-of-the-art MIL models, our model has the following advantages: (1) generating the bag representation in a fully trainable way, (2) representing the bag with a high-level and nonlinear combination of all instances instead of fixed pooling-based methods (e.g. max pooling and average pooling) or simply attention-based linear aggregation, and (3) encoding the position relationship and context information during bag embedding phase. Besides our proposed DT-MIL, we also develop other possible transformer-based MILs for comparison. Extensive experiments show that our DT-MIL outperforms the state-of-the-art methods and other transformer-based MIL architectures in histopathological image classification and prediction tasks. An open-source implementation of our approach can be found at https://github.com/yfzon/DT-MIL.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_20

SharedIt: https://rdcu.be/cyl9Z

Link to the code repository

https://github.com/yfzon/DT-MIL

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In this paper, deformable transformer is introduced into deep multi-instance learning, which globally aggregated instance features and position context information of instances also was encoded during training phase.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This work demonstrated an original way to improve MIL performance on histopathological image. It is the first time introduced the Transformer into histopathological image analysis
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This paper is novelty, and need to improve the statement expression of sentences.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors has clarified the code will be available in the reproducibility checklist.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

1.How to obtain the 2D reference point rq in the part of Deformable Transformer Encoder? 2.In the E.q (3), How to determine the K, Amqk, and ∆rmqk? 3.In the E.q (4), What does i, j, and stand for? 4.On the BREAST-LNM data set, the recall is very poor, but the author did not give the analysis about it in the results and discussion part.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novel idea using deformable transformer for embedded-space MIL model.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose using a deformable transformer for MIL-based whole slide image (WSI) analysis in histopathology applications. The idea is that the transformer will learn to identify discriminative information in the WSI which can then be used to derive a final assessment for the case.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Using transformers to determine WSI instance discriminability is novel and appealing. Having an end to end system which can discriminate between crops brings us one step closer to accurate WSI classification.
2. The method appears to outperform other MIL architectures in this domain on two datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Computational cost of the architecture is unclear; how do the apparent reductions in computational cost affect performance?
2. Performance with respect to other architectures is over simplified; what are the differences in architecture complexity between these models?
3. Passing crops without overlap is questionable and will certainly affect performance depending on the tissue domain.
4. Clarity is a problem throughout the paper, the writing should be improved to make the ideas more clear.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method described in the paper appears to be reproducible. I hope that the authors provide their source code for the community.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors utilize deformable transformers to determine which crop (instance) embeddings are most discriminative, the information of which is then aggregated to produce WSI (bag) level classification outputs, following a classification scheme similar to what is described in BERT. This differs from previous MIL papers in this field which utilized off-line methods to determine whether or not a given crop of a WSI was discriminative. Conceptually, this end-to-end approach makes this work attractive as the network can learn to produce meaningful feature embeddings for each area of the WSI with respect to the overall WSI classification task. Such a technique would therefore be expected to outperform previous MIL works in this domain. This being said, while the idea is aligned with the research communities goals for WSI classification, there are some key weaknesses that detract from the overall impact of the paper. The authors go to great lengths to reduce the computational cost of their architecture; this is to be expected, as self-attention is a highly cumbersome operation. To address this limitation, they utilize an efficientnet-b0 encoder (which is quite small), and employ a deformable transformer, which only utilizes select elements in an input sequence. I would want to see the memory requirements of this network to understand how these adjustments improved upon efficiency with respect to the other architectures. More importantly (and if I am understanding correctly), the authors further improve computational efficiency by passing crops without overlap. This raises concerns in our domain; limiting our field of view to adjacent tissue crops significantly reduces the discriminative power of the network. I would want to see how passing more crops with overlap affects results (I suspect this will improve your recall metric), so that we can judge if the performance improvements justify the added computational cost. Finally, I had a hard time reading this paper; I understand if English is not the primary language of the author, and do not intend to reject the paper on those grounds, but I did struggle to understand various parts of the paper which made it difficult for me to feel confident about the content. I strongly encourage the authors to simplify and improve the writing, as I believe the idea holds merit. Overall, I am recommending the paper as a borderline reject, as I believe this paper is worth publishing once these issues are addressed.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While I believe that the idea presented in this paper is impactful, I believe that there needs to be a more transparent presentation of computational cost. I suspect that this architecture is very cumbersome, which may detract from its usefulness in a clinical setting. Understanding performance improvements with respect to this cost would give us a better idea of how this architecture can fare in our domain. Additionally, a lack of clarity makes it difficult to easily determine the impact of the paper. I would be happy to accept a paper like this one if memory, time, flops, etc. were provided.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The paper deals with the learning of informative representations for Whole Slide Images of tissues. To tackle issues with weakly supervised strategies normally used for such large images (such as MIL scheme), the authors explore the added value of Deformable Transformer architectures. The associated bag representation learning allows the encoding of both the position and the context of the instance features (patches normally) aggregated for the most effective representation learning.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
I understand clearly the added values concerning :
- the fully trainable strategy
- the encoding of position relationships and context information via Transformer architectures
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I will need more explanation concerning the third added value
- representing the bag with a high-level and non-linear combination of all instances : why this way of aggregating is of higher level than fixed-pooling based method or simply attention-based linear aggregation ?
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

For formula (5), it is written “typical-key value attention”, can you add a reference please ?
Then the description of equations (5) and (6) are a bit tricky to me. But it is perhaps due to my discovery of the architecture for 2D WSI.

If I think of the impact of your work, I will explain better the ground representation of a WSI (Huge image, tiles, patches, instance features, bag of xxx etc.) make this Transformer Decoder Paragraph clearer and discard the three different bag-embedding modules if there is space issues to fit the 10 pages.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Paragraph 2 / Introduction : I will make a clear definition of bag and instance level (in correspondence with the patch/ tile notion over a WSI to make it clearer for the reader not used to WSI). It will make clearer both the ES paradigm and your approach for the following of the paper. I know that the space is limited (so perhaps precise that you explain more in the method section, but once again readers needs to understand clearly the difference / similarity between instance, instance-level features, bag of xxx, patches etc.)/

The introduction explains clearly the issues and the state-of-the-art introducing nicely the use of Transformer (sequence-to-sequence architecture) in this work as replacing RNN in NLP nowadays. I do not know if it is the first time in histopathological image analysis but is is clearly a new trend as another paper in MICCAI reviewing process this year is using it for the first time as well.

I would say that the other possible transformer-based MIL/architectures as an optional add-on to this paper. I will focus on the Transformer-like architecture experiments for histopathological image analysis.

Methods.

The methods spans three modules/heads : PPDR + TBBE + classification. Can you try to explain what “Deformable” stands for ? I have an intuition but if you can make it clearer.

To reduce the size of a WSI, do you really reduce its scale ? Or its size in terms of information? Open question.

Paragraph 2 / Methods : A superpixel has a specific meaning to me (aggregation of pixels following a segmentation process). For you it is features extracted from patches. Are they still pixel like ? Open question. Please define W and H. The size of the patch right ? How is is set. These questions are answered at the end ot the paragraph, it is ok but then comes R x C which is not clear to me.

Can you give a magnitude for D ? Are you sure that in your compression formula it is D or d? And in the setting then. This paragraph deserves to be as clear as possible concerning the notations. We are close to that but a small effort will make it crytal clear for the reader. Paragraph 3 and 4 are very technical and it sounds good to me with a lot of tweaking for 2D Transformer that I hope will be easily reproducible.

Transformer Decoder Paragraph. “that [is] using six blocks” I guess. “the model complex[ity]” I guess. Can you explain more “we set a learnable embedding as the classification token”?

For formula (5), it is written “typical-key value attention”, can you add a reference please ?
Then the description of equations (5) and (6) are a bit tricky to me. But it is perhaps due to my discovery of the architecture for 2D WSI.

If I think of the impact of your work, I will explain better the ground representation of a WSI (Huge image, tiles, patches, instance features, bag of xxx etc.) make this Transformer Decoder Paragraph clearer and discard the three different bag-embedding modules if there is space issues to fit the 10 pages.

Experiments

They are carried on two cohorts :
1. Breast cancer : 4000 h&E at 20x reviewed by 2 experts Quite balanced set of lymph node metastasis prediction. Canyou precise the Size of the slides ?
2. 1000 WSI H&E x20 Lung Adenocarcinoma Cancer Diagnosis (2/3 cancer)
Prediction and diagnosis ? Both binary classification ? If not, explain more the task of metastasis prediction (tumor progression)

Can you explicit a possible evaluation with CLAM method which is an interesting alternative to MIL in weakly supervised classification. https://arxiv.org/abs/2004.09666

Results and Discussion

“The comparison results of DT-MIL and RNN show that the superiority of the proposed model that fully” → Remove the first “that” I guess. Can you elaborate on the long-range context information usefulness ? How is it technically done so and why is it so important biologically (and which range is reasonable in a micro tumoral environment). Results are convincing about the prediction task and the effectiveness of deformable transformer attention method (with long or medium range interactions). “has to potential to efficiently capture the instance feature and interaction” → express clearer (“has the potential “I guess) + what do you mean by “capture instance features” ? Once again I will not write about ViT and DTEC-MIL experiments inhere but rather describe visually a patch specific descriptor that DT-MIL finds out compares to other methods and explain its “superiority”.

Many thanks for this excellent work.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Technically sound Important challenge Scientifically justified and well validated
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper studies learning informative representations for Whole Slide Images of tissues by using using a deformable transformer under the multi-instance learning setting. The reviewers consistently think the presented method is novel and the paper has been clearly written. The AC agrees with the reviewers and recommend an acceptance of this paper. The AC also encourage the authors carefully read the all the review comments and address the issued raised therein in the final version.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

We appreciate that the reviewers and AC consistently think the presented method is novel and the paper has been clearly written. We will address the issues briefly here and will clarify the details in the final version.

R#1: How to obtain the 2D reference point rq? The numerical matrix of rq is the grid points generated by the meshgrid function, which represents a uniform sampling of the input feature map at a fixed interval.

R#1: In the E.q (3), How to determine the K, Amqk, and ∆rmqk? K is predefined as 4, Amqk, and ∆rmqk are all learnable parameters, and the training process adjusts the optimal weighted weights Amqk and offsets ∆rmqk. The proposed method uses multi-head attention, so there will be multiple sets of weights Amqk and offsets ∆rmqk for the same reference point.

R#1: In the E.q (4), What does i, j, and stand for? The position encoding is extended to the 2D case, for a dimension of the 2D case, i indicates the order of patches in that dimension, while j is used to indicate whether it is the odd or even number.

R#2: The computational cost. R#2 points out that we use many attempts to reduce the computational cost, and we would like to clarify some details here. In the deformable transformer, the conventional self-attention has been changed to a deformable self-attention header, which significantly reduces the computation resource demand. The transformer part not only optimizes the elements selection process of an input sequence but also provides a high-level embedding. We don’t think performance is directly related to the computation cost, for instance, EfficientNet-b0’s computational requirements are one-tenth of ResNet-50, but its performance is slightly improved than the latter. The memory requirement has been provided in the experiment part. As Tesla P40 has 24GB of GPU memory and the training batch size is 2, thus the proposed model is fully capable in a clinical setting.

R#2: Passing crops without overlap affects performance. Due to the translation invariance of convolution, taking overlap in the proposed method is only equivalent to add more duplicated data in the extracted feature map. Since this paper does not deal with region detection or segmentation task, we believe that taking overlap will not bring significant performance improvement, but also increase the computation cost, which is also the concern of R#2.

R#3: Compare with fixed-pooling based method and simply attention-based linear aggregation. For both fixed-pooling-based method and simply attention-based linear aggregation, they are simple single-level feature transformations. In the transformer-based model, the multi-headed attention mechanism is able to capture different combinations of attention weights, and these multiple combinations produce higher-level features.

R#3: Add a reference to the “typical-key value attention”. Typical-key value attention refers to the attention mechanism within the transformer, and the corresponding paper is “Attention Is All You Need”.

R#3: What does “Deformable” stand for? Deformable in this paper refers that when an image patch is processed, a weighted combination of its surrounding patches is simultaneously applied, and the selection of the surrounding patches is not based on a fixed grid but based on a learnable offset.

R#3: Really reduce the scale of a WSI? Are the features extracted from patches still pixel like? The scale reduction refers to the scale of the feature map is reduced, and the feature extracted from the image patch is a point in the feature map.

R#3: The magnitude for D? It depends on the output size of the feature extraction network. For example, in EfficientNet-B0, D is 1280.

R#3: The writing, typo, symbol definitions, and additional references. We will follow the Reviewers’ and AC’s suggestions on the above issues to improve the soundness and clearness of the paper and make sure the final version meets the high-quality standard of the MICCAI.

back to top

DT-MIL: Deformable Transformer for Multi-instance Learning on Histopathological Image