Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mara Graziani, Iam Palatnik de Sousa, Marley M.B.R. Vellasco, Eduardo Costa da Silva, Henning Müller, Vincent Andrearczyk

Abstract

Being accountable for the signed reports, pathologists may be wary of high-quality deep learning outcomes if the decision-making is not understandable. Applying off-the-shelf methods with default configurations such as Local Interpretable Model-Agnostic Explanations (LIME) is not sufficient to generate stable and understandable explanations. This work improves the application of LIME to histopathology images by leveraging nuclei annotations, creating a reliable way for pathologists to audit black-box tumor classifiers. The obtained visualizations reveal the sharp, neat and high attention of the deep classifier to the neoplastic nuclei in the dataset, an observation in line with clinical decision making. Compared to standard LIME, our explanations show improved understandability for domain-experts, report higher stability and pass the sanity checks of consistency to data or initialization changes and sensitivity to network parameters. This represents a promising step in giving pathologists tools to obtain additional information on image classification models. The code and trained models are available on GitHub.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_51

SharedIt: https://rdcu.be/cyl4L

Link to the code repository

https://github.com/maragraziani/MICCAI2021_replicate

Link to the dataset(s)

https://camelyon17.grand-challenge.org/Data/

https://warwick.ac.uk/fac/cross_fac/tia/data/pannuke


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a methodology to improve the reliability and explainability of saliency maps produced for histopathology. The authors propose a method that improves over the standard LIME approach which is a black box methodology for creating saliency map. The approach was tested on three publicly available datasets. The main contribution of the paper seems to be in the observation that the superpixel approach is not optimal. The authors propose to use a pre segmentation as superpixel equivalent, followed by a division of the input image into nine blocks of smaller size which improves the balance foreground to background.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -very relevant topic end of need for a clinical translation of deep learning technologies -selected datasets are nice and the fact that they are publicly available is a good characteristic of this study -The approach is simple yet effective.I also highlight the fact that many approaches interpretability are focusing on gradient based explanation frameworks whereas here the authors propose to work on the LIME system which aims at model agnostic

    • I appreciated the efforts from the authors to add the sanity checks as proposed by others in the past
    • The experimental setup and statistical analysis are well structured
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -It is not clear as to which extent the 9 by 9 blocking approach or the superpixel via segmentation makes a difference with respect to the original LIME approach. Probably an ablation study could have been added or results mentioned within the text

    • There is no argumentation as to why the proposed combination of blocking division and segmentation improves the saliency maps -the qualitative part of the study is indeed a limitation -the quality of the figures could be improved. Some of them are of low quality for example figure 3A and B as well as figure 5 is very small (but I guess there was a problem of page limitation)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good reproducibility, and appreciated

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • The approach seems to be more consistent and more reliable (through the sanity checks) than the original LIME approach, which is very good news. However, the qualitative analysis shows maps that do look closer or resemble more to a segmentation mask (Fig. 2). My concern here is that such resemblance might bias the evaluation of the expert believing that these maps are more in line with an explanation. This recalls some of the arguments in the works of sanity checks were a “good looking” saliency map might not necessarily mean a better explanation.
    • Did the authors run a sensitivity analysis on segmentation quality and of the 9x9 division step? Overall, it would have been interesting to elucidate why such combination works. (It is only mentioned that there is a better balance FG/BG).
    • Is the approach limited to the way LIME works (I believe it is not) but it’d be nice to comment about this possibility
    • Would have been possible to have standard LIME as well on Fig3? I think this would have helped to see the benefits of Sharp-LIME.
    • Authors mention improved results of their approach to GradCAM, but i couldn’t find those results in the paper. I would also argue that eventually a better comparison would be to systems like LRP or DeepTaylor or DeepSHAP. Actually, I was surprised of reading grad cam since I believed the paper is about model agnostic interpretability (minor)
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Very relevant study and topic
    • Simple yet effective
    • To be improved: (i) more rationale as to why the changes to the standard LIME improve the saliency maps, (ii) a (semi-)quantitative evaluation of the newly produced saliency maps (e.g. randomized rating based approach).
  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper proposes Sharp-LIME: a refined application of LIME to histopathology images by leveraging nuclei annotations, creating a reliable way for pathologists to audit black-box tumor classifiers. The obtained visualizations reveal sharp, neat, and significantly high attention of the deep classifier to the neoplastic nuclei in the dataset, an observation in line with clinical decision making. Compared to standard LIME, the Sharp-LIME explanations show improved understandability for domain-experts, report higher stability, and pass the sanity checks of consistency to data or initialization changes and sensitivity to network parameters.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work focus on a very relevant topic as it is very important to provide better explanations to the clinicians. The proposed method represents a promising step in giving pathologists tools to obtain additional information on image classification models. Also, the paper supports reproducibility as the authors claim that the code and trained models will be released on GitHub for full reproducibility.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The figures 4 and 5 are not clear to the reader.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper supports reproducibility as the authors claim that the code and trained models will be released on GitHub for full reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The figures 4 and 5 are not clear to the reader, they should be improved.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposes a novel method that builds upon a well established method - LIME - with shown improvements. It is scientifically sound, well written, and well organised. Thus mty recomendation is Accept.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    4

  • Reviewer confidence

    Somewhat confident



Review #3

  • Please describe the contribution of the paper

    The paper proposes an attribution method building on top of the LIME method. THE original LIME method perturbs a random selection of superpixels to find perturbed samples in the neighborhood of the image. The paper proposes using superpixels based on nuclei contours. The resulting attribution is visually aligned (sharper, thus referred to as Sharp-LIME) with the countors of nuclei in the image. The attribution method is evaluated qualitatively by experts and is quantitatively evaluated with randomization sanity check and consistency in terms of using different seeds for optimization (consistency) and the value of explanation weights (nuclei vs. background).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1- The paper proposes a clever variation of LIME suited for the specific problem. Although, in the reviewer’s opinion, this does not mean that it is a superior method compared to LIME, but the new method reveals different information (see detailed comments). This can be an interesting contribution ff discussed as such (currently not discussed).

    2- The methodology is clear and the background method is explained properly

    3- The paper evaluates the method in terms of the randomization of network parameters (randomization sanity check). Showing it is sensitive to the model.

    4- The paper evaluates the method in terms of runs with different seeds, showing its reproducibility.

    5 - The paper shows that the explanation values for nuclei are significantly higher than background superpixels.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    W1- (Qualitative evaluation) Three experts compare maps of LIME, GradCAM (which is not visualized in the paper), and Sharp-LIME. It is natural that when contours are used for superpixels, the attribution looks visually pleasing and easier to understand. But, is the attribution map reflecting the model behavior? Do the values show the contribution of the features for the output? In a trivial case, one can show a segmentation of the image, where each segmented component is colored with a random value. The resulting artificial attribution map would be considered more understandable by experts.

    W2- (Is Sharp-LIME revealing contributing features?) This evaluation is missing. “quantification of network attention” experiment is not enough. Refer to detailed comments.

    W3- (Comparison to other attribution methods) Among many attribution methods available (e.g. GradCAM, SHAP, Extremal Perturbations, Integrated Gradients, Information Bottleneck, LIME, …) the authors have resorted to LIME only and improved upon that. The reasoning why we resort to LIME (and not use GradCAM or Extremal perturbations for example) and comparative experiments is missing (only in qualitative evaluation it is stated that GradCAM is shown to experts).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The code is stated to be available on github but is not provided in the supplementary materials
    • The details regarding experiments are provided
    • Considering the clarity of results, experiments and the method, the results are trustworthy, even if the code is absent
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    W2 - (evaluation of contributing features) There are several approaches to evaluate this objectively [1,2]. The underlying principle behind all these methods is to remove (perturb) the pixels (superpixels) and observe the change in the output of the neural network. If the removed pixels have a significant effect on the output and are also revealed to be important by the attribution map, then the attribution map is correct. Whether it looks visually pleasing or not. The reviewer acknowledges that (quantification of network attention) is a step towards this direction and it is shown that the values are higher for nuclei compared to background. But are the values for the nuclei correct? How about the comparison to the original LIME?

    (SharpLIME not better, but different view) It is the reviewer’s understanding that Sharp-LIME is not better than LIME, they are revealing different information. In the original LIME method where random superpixels are used, the goal is to find what regions are contributing to the output without knowing any priors about what exists in the image. This has the advantage that it reveals contributing features objectively. Some features (even background features) could be contributing to the output that we do not expect. And LIME (or many other attribution methods) aims to find the contribution of all regions without knowing what features are in the image. The proposed methodology is evaluating something else. It is placing a prior. It is stating that ok we know some objects exist in the image, and we want to see if these objects are important for the output. Therefore we assign a single contribution score to this object (which can be done using occlusion for example). The paper is using LIME to compute the contribution of this superpixel of interest. Therefore the method is evaluating whether this superpixel has importance compared to background superpixel. Which is informative for the users. But it does not mean that it is better than GradCAM for example, which highlights any region that may have a contribution to the output. It may be that only part of the nuclei is important, if we have selected the entire nuclei as a superpixel we will assign a single score to it. But in this specific case, a method such as GradCAM/LIME (or a better flawless method with the same concept) can reveal such information. In general, the reviewer believes that such a difference should be clearly highlighted all over the paper (starting from the abstract) to make the contribution clear for users, instead of stressing that SharpLIME is sharp and better.

    [1] Hooker, Sara, et al. “A benchmark for interpretability methods in deep neural networks.” NeurIPS (2018). [2] Samek, Wojciech, et al. “Evaluating the visualization of what a deep neural network has learned.” IEEE transactions on neural networks and learning systems 28.11 (2016): 2660-2673.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is modifying LIME in a reasonable way. The qualitative evaluation has issues and a quantitative experiment for quantifying the importance of features is missing. However the strengths of the paper balance the weaknesses. In the reviewer’s opinion selecting pre-defined superpixels reveals the behavior of the model from a different perspective, which is informative, but does not mean that it is better. This aspect is not discussed in the paper. A clear discussion of his aspect, which is the main difference with LIME is “necessary” for the contribution to be clear for the readers.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a methodology to improve explanability of saliency maps produced for histopathology. The work is novel and useful

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

We thank the reviewers and the area chair for the positive feedback. We read the comments carefully and there were no major misunderstandings or criticisms. We clarify a few minor concerns in the following:

  • Clarify what the improvement is with respect to the original LIME (R1 and R3): Sharp-LIME visualizations show lower variability and have higher explanation weights. The explanations are also clearer and more understandable according to consulted clinicians than those of traditional LIME. The improvement is the result of a more appropriate segmentation choice than unsupervised pixel clustering methods used by standard LIME (i.e. Quickshift).

  • Minor misunderstanding; comparison to standard LIME was not in the Figures (R1): The qualitative comparison to standard LIME is in Figure 2. For the quantification of the explanation weights in Figure 3, we refer to a previous study on standard LIME. We added the reference to the figure caption to clarify this point.

  • “Prior” on the pixel shape (R3): We understand and appreciate the point made by the reviewer. However, whether a particular region will have high explanation weight is independent of the heatmap being visually pleasing to the user. A heatmap with superpixels of familiar shapes but low explanation weights indicates that that particular segmentation is not generating a meaningful explanation. In our case, however, such segmentations did yield high explanation weights. At the same time, we agree that trying additional segmentation techniques and approaches may give other interesting complementary visualizations. Particularly the 9x9 boxing applied to the background could be replaced with other segmentation methods in future studies. Rather than getting a single “best” explanation, the goal should be to compare multiple good (high explanation weight) heatmaps and perhaps find regions of agreement. This discussion will be added to the camera-ready version.

  • Improve figures 3, 4, 5 (R1 and R2): The size of the figures will be increased for better visualization and we will further increase the resolution.

  • Minor misunderstanding on reproducibility (R3): We could not provide the Github link upon submission, due to the anonymity requirements. The code is already prepared for sharing and will be available on our repository with a link added to the camera-ready manuscript.



back to top