Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Shuang Yu, Kai Ma, Qi Bi, Cheng Bian, Munan Ning, Nanjun He, Yuexiang Li, Hanruo Liu, Yefeng Zheng

Abstract

With the advancement and prevailing success of Transformer models in the natural language processing (NLP) field, an increasing number of research works have explored the applicability of Transformer for various vision tasks and reported superior performance compared with convolutional neural networks (CNNs). However, as the proper training of Transformer generally requires an extremely large quantity of data, it has rarely been explored for the medical imaging tasks. In this paper, we attempt to adopt the Vision Transformer for the retinal disease classification tasks, by pre-training the Transformer model on a large fundus image database and then fine-tuning on downstream retinal disease classification tasks. In addition, to fully exploit the feature representations extracted by individual image patches, we propose a multiple instance learning (MIL) based `MIL head’, which can be conveniently attached to the Vision Transformer in a plug-and-play manner and effectively enhances the model performance for the downstream fundus image classification tasks. The proposed MIL-VT framework achieves superior performance over CNN models on two publicly available datasets when being trained and tested under the same setup. We will release the implementation code and pre-trained weights for public access.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_5

SharedIt: https://rdcu.be/cyl9B

Link to the code repository

https://github.com/greentreeys/MIL-VT

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The main idea of the paper is to propose a multiple instance learning (MIL) based ‘MIL head’, which can be conveniently attached to the Vision Transformer in a plug-and-play manner and effectively enhances the model performance for the downstream fundus image classification tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of MIL head is interesting and novel.
    • The use of MIL head is effective and may be helpful to improve the classification performance.
    • The experiments are presented with an ablation study and comparison. Details analysis of the results is given.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • What is the difference between the MIL Embedding and Patch Embedding? There are similar operations in these two modules.
    • More details about why MIL head could be helpful for improving the classification performance.
    • Low-dimensional embedding (formulation 1) is similar to the FFN method in the transformer, could it be understood as a 1x1 convolution operation here to extract nonlinear features?
    • The author claims that individual patches may contain important complementary feature information, but the Attention aggregation function only included two linear layers, layer normalization, a ReLU layer, a dropout layer, and a softmax layer. I am not convinced that such an operation can get more complementary feature information than patch embedding.
    • Vision transformer contains the self-attention, but MIL head still adopts the attention way. Is the attention way redundant?
    • Transformer in Transformer(https://arxiv.org/pdf/2103.00112.pdfarxiv.org) is used to handle the same problem that patch embedding loses the complementary feature information. Therefore, what is the difference between TIT and MIL, and which is better?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    no code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    see 4

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed MIL head for vision transformer is novelty for fundus image classification, and experimental results demonstrate its effectivness.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    6

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper applies the vision transformer on the retinal disease classification task with pretraining on a large fundus image database. This paper also develops a multiple instance learning head, which can be combined with vision transformer to enhance the model performance. Superior performance is demonstrated on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    By using the large fundus database rather than the ImageNet dataset, the finetuning on retinal disease classification can achieve better performance. The feature representation of individual patches are utilized by the multiple instance learning head to help improving the classification performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The pretraining on a large fundus image classification database uses the pretrained model on ImageNet. It means the model pretrained on the large fundus database use extra data to train the model compared to the pure ImageNet based pretrained one. Since a weighted average of two classification predictions is used as the final prediction result. The contribution of each part is not clear by only reporting the results of a empirically set value of 0.5. The motivation to utilize the feature representation is easy to be understood. But the motivation use MIL is not clear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code not provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    How does the value of lambda influence the performance? Different aggregation strategies, for example avg, mean pooling can be considered as baselines to study the effect of the proposed MIL head. ‘only they can it perform well on downstream classification’ in page 2

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In addition to apply ViT to the fundus image classification, the paper originally proposes an MIL head to further take advantage of the feature representations. Experimental results on two datasets demonstrates the effectiveness of the proposed methods.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    3

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors adopt the Vision Transformer for the retinal disease classification tasks. The Transformer model is pre-trained on a large fundus image database and then fine-tuned on retinal disease classification tasks. The authors propose a multiple instance learning (MIL) module to fully exploit the feature representations extracted by individual image patches. The proposed method has been tested on two fundus image classification datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. It firstly adopt the Vision Transformer for the retinal disease classification tasks, by pre-training on a large fundus image database.
    2. The authors propose a multiple instance learning (MIL) module which exploit the features extracted from individual patches.
    3. The proposed method has been evaluated on two fundus image classification datasets, APTOS2019 and RFMiD2020, and achieved good performances compared with existing methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method is supposed to be compared with state-of-the-art methods. The compared methods are mainly based on ResNet34, which is not an advanced baseline nowadays. More strong baseline models should be considered for comparison.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The methodology is clearly demonstrated. It seems that the authors use private large dataset for training, therefore, it may be hard to reproduce the experiments without this dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Regarding table 1 and 2, it will be more convincing if you can show the results of VT(ImageNet) with MIL. A typo: in the last sentence in Section 2.2, ‘The proposed MIL head… take full utilization of…’, ‘take’ should be ‘takes’.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The introduction for the method is good. The experiment designs and results are satisfactory. But the novelty is limited.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a ‘MIL head’ that can be conveniently attached to the Vision Transformer in a plug-and-play manner, to improve the classification performance on fundus images. Given three consistent positive reviews, I recommend accepting this submission. The authors should address the detailed comments from the reviewers in the camera-ready manuscript.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1




Author Feedback

N/A



back to top