Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiahui Li, Wen Chen, Xiaodi Huang, Shuang Yang, Zhiqiang Hu, Qi Duan, Dimitris N. Metaxas, Hongsheng Li, Shaoting Zhang

Abstract

When both coarse-level (e.g., image-level labels) and finelevel (e.g., pixel-level delineations) annotations are partially available, an intuition is to improve the accuracy of classification by leveraging both types of supervision. However, in computational pathology this becomes a very challenging task, since the high resolution of whole slide images makes it nearly unattainable to perform end-to-end training of classification models. To handle this problem, we propose a novel hybrid supervision learning framework for high resolution images with sufficient image-level coarse annotations and a few pixel-level fine labels. During the training process of patch models, our framework can carefully make use of coarse image-level labels to refine generated pixel-level pseudo labels. A comprehensive strategy is designed to suppress pixel-level false positives and false negatives. Over 10,000 whole slide images with size of 20 terabytes (one of the largest digital pathology datasets in literature) are labeled and employed to evaluate the efficacy of our hybrid supervision learning method. By extracting pixel-level pseudo labels in initially image-level labeled samples, we reduced the false positive rate by around one third compared to state of the art while retaining 100% sensitivity, in the task of image-level classification.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_30

SharedIt: https://rdcu.be/cymal

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed a hybrid supervised WSI classification network which uses both WSI level and coarse pixel-level labels for representation learning. They evaluate the proposed method on 6,213 WSIs and shows that their proposed method outperformed both WSI-level and pixel-level classifiers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea is really interesting which utilize the limited pixel-level annotation available in histology image datasets to improve WSI level performance.

    The authors evaluate the proposed method on a fairly large cohort of 6,213 WSIs to demonstrate its superior performance over WSI-level and pixel-level classifiers.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors should clearly state that whether they used separate validation set for best model selection or they used test-set for optimal model section.

    The number of positive and negative samples in the dataset is quite imbalace with a ratio of 1:9.

    Authors should evaluate the proposed method on cohorts with fairly balanced number of samples for each class.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have given enough implementation details to reproduce the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Comparison with other methods which follows similar approach of psudo-labelling will help to show the significance of the proposed approach.

    Lu, Ming Y., et al. “Data-efficient and weakly supervised computational pathology on whole-slide images.” Nature Biomedical Engineering (2021): 1-16.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The utility of the proposed method is relatively new in histology image analysis. Moreover, the significant improvement in performance compared to its counterpart make this method an interesting read for MICCAI community.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper proposes a framework to utilize both strongly supervised and weakly supervised labels. This framework was further implemented to classify histopathology images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    First, the task that this paper attempts to solve is indeed clinically relevant. Besides, the EM framework proposed in this paper is novel.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are several weaknesses of this paper.

    First, many statements in this paper are somewhat unclear? For instance, in the second sentence of the introduction, why high-resolution labels make it difficult to train end-to-end? If it’s a GPU memory concern, what would be a problem if high-resolution labels were trained in patches? Also, in the second paragraph, what does it mean by converting false negatives to true positives?

    Secondly, the related work is insufficient. Only a few works on weakly supervised learning and semi-supervised learning were listed. What about those works, such as [1], that utilize a similar mixed supervision framework as this paper does?

    Thirdly, the experiment set-up in this paper is flawed. No validation set was used to select hyper-parameters and determine which model to use. For instance, this EM algorithm requires several iterations to terminate. The authors pointed out that the specific iteration depends on the current sensitivity/specificity achieved. Are those metrics reported on the training set or the test set?

    In addition, this paper only compares a limited number of methods (2) on a private dataset. It remains unclear about the generalizability of this work.

    Lastly, why did the authors choose to compare specificity at 100% sensitivity. Is there any strong clinical reference for this choice of evaluation setting?

    [1] Mlynarski, Pawel, et al. “Deep learning with mixed supervision for brain tumor segmentation.” Journal of Medical Imaging 6.3 (2019): 034002.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code is provided. Dataset is private.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Please see weakness.

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors for this score are the experiment set up and limited comparison among SOTA methods in publically accessible datasets.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    4

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The authors proposed a hybrid supervision learning framework trained with a limited number of pixel-level annotations and an abundant image-level labels. The model maintains the sensitivity at 100% to mitigate false negatives.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A large dataset of over 22 Tb is collected for performance evaluation.
    2. The motivation to train with hybrid annotation is useful in real clinical practise as fine-grained annotation is restricted for histology.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.Relations between other approaches trained with limited annotations e.g., weakly supervised/ active learning should be discussed at least in Introduction.

    1. Even though the author re-use the codes from previous works, they should discuss the backbone CNN architecture in the paper.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Probabiliy convincing. The hyper-parameters for evluation is explained in detail. But, the annotation strategy (how many pathologists label the data?) and the pre-processing stage (whether color normalization is applied?) is missing. What is the used network architecture?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. Please clarify what the convergence criteria in Algorithm 1 is.
    2. I fully suggest the author to take the comparison with other pure state-of-the-art MIL in to account for better evluations.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is quite novel. The paper is well-organized.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    While reviewers gave positive comments on the idea of using both image-level and pixel-level labels for whole-slide image classification and the algorithm evaluation on a large-scale dataset, they raised several critical issues, such as unclear experimental setup (whether a separate validate set is used for model selection), lack of a comparison with recent state-of-the-art methods in the experiments, and insufficient survey/discussion of the relationship between the proposed method and other related work. In addition, paper presentation needs to be improved.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    6




Author Feedback

We thank all reviewers for their valuable comments.

1, About experimental setup: 1) Whether a separate validate set is used for model selection: No separate validate set is used , but the model selection does not rely on the test set. Specifically, Validation procedure is performed by minimum training loss selection of stage2, choosing round 2 means the top K patches selected in round2 is the optimal inputs for stage2 to fit image-level labels in training data. Then scores are reported on test set. 2) Data ratio 1:9 : The data ratio 1:9 of positive and negative samples follows the populational distribution in clinical practice. 3) 100% Sensitivity: Misdiagnosis may lead to serious medical accidents and should be avoided as much as possible. That is why metrics in the paper require models to retain 100% sensitivity. Under this premise, a higher specificity means a more portion of medical duty that can be saved with a model-based screening.

2, Lack of comparison with recent state-of-the-art methods: In Camelyon17, with this method we got 0.9243, 4th in leader board, submitted in Feb. 2020. As comparison, pixel-level code that compared in paper reaches 0.9090, 5th. The 0.9273, 3rd is quite similar to ours. The 0.9386, 2nd is the results of heavy ensembles of large amounts of models. The champion does not provide enough details. This is the public visible link of this public challenge that still open for submission: https://camelyon17.grand-challenge.org/evaluation/challenge/leaderboard/

3,Insufficient survey/discussion of the relationship between the proposed method and other related work:

Most of weakly / semi- supervised learning required entire image input with image-level label, which is not feasible on our whole slide image of 10,000 * 10,000 pixels.

1) “Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images”: This paper focuses on the optimization of the image-level pipeline. However, a few pixel-level annotations are reasonably affordable. Experimental results in this paper show that a hybrid pipeline exhibits a significant improvements over its image-level counterpart.

2) “Deep learning with mixed supervision for brain tumor segmentation”: As for this, they could also feed a complete 2D slice of brain tumor of 300 * 300 pixels to train model with image-level and pixel-level annotations, which is not feasible on our whole slide image of 10,000 * 10,000 pixels.

4, Writing, data and code: We will improve the presentation. Part of the data has already been released (link was not given to follow double blind review) and more will be released upon IRB approval. Code was also uploaded to authors’ Github and will be released to public after the review process.

5, GPU concern: A high-resolution pathological image of 10,000 X 10,000 pixels with image-level labels can not be end to end trained for limited GPU memory. If trained in patches, most of patches in one positive image are negative, only several unknown patches are positive.

6, “Converting false negatives to true positives”: For actual positive patches that model predicts to be negative, we multiply the predict confidence to be the soft pseudo label for training in next round, which will lead model prediction to be positive.

7, CNN Architecture: Ours is Deep Layer Aggregation of 34 layers, those of Image-level is resnet34, and ensemble of Deeplabv3+, DenseNet121 and Inception-Resnet-V2 in Pixel-level. Models of Pixel-level experiment are much more complex and deeper than ours but shows worse performance. Data utility is the key that influences results. This phenomenon is proved in leaderboard of Camelyon17, where we got 0.9243 4th, outperformed the 5th by 0.0153 with simple architectures, whose code we used for Pixel-level experiment.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a hybrid supervision learning method to use both image-level and pixel-level annotations for whole-slide image classification, and the method shows better performance than approaches using only image or pixel labels. The rebuttal has addressed the major concerns of the reviewers, such as clarification of experimental setup, comparison with recent state-of-the-art methods and survey of some other closely related work and their difference. Thus, the paper is recommended to be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposed a hybrid supervision approach to classify whole WSI images, i.e., a large amount of image-level labels and a small amount of pixel-level labels are used. The authors clarified the experiment setup and compared on a public dataset in the rebuttal. The current related work section is too short and more discussions and comparisons with related works will show the motivations & contributions of this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper has good aspects as the original reviews commented. However, after reading the paper, I agree with the primary AC’s statement as “several critical issues, such as unclear experimental setup (whether a separate validate set is used for model selection), lack of a comparison with recent state-of-the-art methods in the experiments, and insufficient survey/discussion of the relationship between the proposed method and other related work. “ Authors used a well-known open dataset so the above issues are so critical to ignore or forgive. At least the current version is not ready for acceptance. This paper has some future potentials.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    12



back to top