Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Catarina Barata, Carlos Santiago

Abstract

Explainability is a key feature for computer-aided diagnosis systems. This property not only helps doctors understand their decisions, but also allows less experienced practitioners to improve their knowledge. Skin cancer diagnosis is a field where explainability is of critical importance, as lesions of different classes often exhibit confounding characteristics. This work proposes a deep neural network (DNN) for skin cancer diagnosis that provides explainability through content-based image retrieval. We explore several state-of-the-art approaches to improve the feature space learned by the DNN, namely contrastive, distillation, and triplet losses. We demonstrate that the combination of these regularization losses with the categorical cross-entropy leads to the best performances on melanoma classification, and results in a hybrid DNN that simultaneously: i) classifies the images; and ii) retrieves similar images justifying the diagnosis. The code is available at \url{https://github.com/catarina-barata/CBIR_Explainability_Skin_Cancer}.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_52

SharedIt: https://rdcu.be/cyl4M

Link to the code repository

https://github.com/catarina-barata/CBIR_Explainability_Skin_Cancer

Link to the dataset(s)

https://challenge.isic-archive.com/data

Reviews

Review #1

Please describe the contribution of the paper
This paper explores explainability of image-based skin cancer diagnosis by incorporating a content-based image retrieval (CBIR) module that provides the user with similar images and their diagnoses. As motivation, the paper notes that latent spaces trained for classification often do not perform well for image retrieval, and apply additional losses optimizing to strengthen separability in the learned latent space, via a triplet loss, a contrastive loss and a distillation loss.
- Results are validated on ISIC, which is a good size dataset. Improvements are moderate.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The idea is logical and easy to follow. Seems to have some effect on output, but not huge.
- There is a visible improvement in the latent space visualizations.
- Paper is relatively easy to read.
- I like the comparison with multiple architectures
- Empirical performance looks good
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The methodological novelty is limited
- The idea of using CBIR to provide explainability is not new, as this was done in [1].
- It does not seem that the authors compare to [1], which differs from the current approach in that the triplet loss for CBIR was trained separately, not jointly with the diagnostic predictor
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility is fine.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please see above.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is nice but has limited novelty.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper tackles the problem of explainability of modern deep learning techniques by using a content based image retrieval (CBIR) approach. This solution allows the authors to account for two of the main design needs for methodological translation to the clinic among other listed applications: (i) reliability and (ii) visualization interface.

The approach presented here incorporates regularization losses to the standard cross-entropy loss for prediction: triplet loss, contrastive loss and distillation loss. They are able to show that these three losses forces the latent space learned by the neural network to have a better structure (e.g., examples from the same class lie closer to each other and further to other classes). This helps the CBIR step and, in a positive feedback loop, the result from the CBIR step can be used to generate an aggregate score (e.g., majority voting) between the selected images that improves upon the original classification loop.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The main strenghts of the paper can be summarized as:
- Very well written manuscript and pleasant to read. The problem is properly introduced and motivated, the goals and contributions are clearly stated and the methodology well explained. Training and evaluation details are also detailed.
- A novel formulation for explainability in skin cancer: they incorporate to the standard classification scheme a CBIR module that will retrieve the most similar images to the input. This is specially interesting for helping doctors in the clinic and train less experienced practitioners.
- Extensive testing of different configuration of framework: hyperparameters, losses.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The main weakness of this work may be the presentation of the results:
- In Table ,1 I believe @GAP means that just one-head of the network is used but clarification is needed.
- It’s not entirely clear why the combination of the three losses is not shown in Table 1.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have committed to make the code publicly available on acceptance. They also provide a link to the training data. Training strategy is explained and hyperparameter selection technique is also detailed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
I have just some minor comments:
- If space allows, a simple phrase defining what is a positive and a negative examples for the triplet loss would make the reading a bit clearer. 2.- Fig.3 can be improved as legend and camptions are hardly readable. Some suggestions are: increasing font size, placing all shared legends (per row) outside the plot and using seaborn package, include a grid 3.-. Fig.4 caption can be placed outside the plots as it is shared for all 3 subfigures. 4.- In the caption of Fig.5 a sentence indicating what each row shows (no regularization/regularzation losses) may be needed.
Suggestion on future work: after going through the images provided in the results section, it would be very interesting to see how this methods performs in tasks such as outlier detection or detecting noise in labels.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In my opinion, this is a strong accept. The paper is very well written and pleasant to read. The methodological approach is properly presented and design decisions are justified or analysed in the experiments section. Finally, I believe that the benefis of the method as well as the potential clinical impact is of interest to the medical imaging community.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper proposes to use a deep learning method to help on skin cancer diagnosis. The method simultaneously performed a classification of images and retrieve similar images justifying the diagnosis.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper is the capability of the model to perform simultaneously the classification and the retrieval. The second strength is the combination of the several loss functions including some who are generally used for self-supervised methods. Each of these functions is very well explained. Moreover, the result part provided a comparison of the different loss functions.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of the paper is the lack of clarity of the results. It is unclear if only one type of loss function is used or it is a combination of multiple loss functions. Table 1 presents results for loss distillation loss, we could suppose it is results with only this loss function, but on the text, it is indicated “when the regularization losses are incorporated…”. Moreover, what is the significance of “@GAP” on this Table? Finally, the results seem not provided the performance of the use of the four losses.

The title and keywords of the paper suggest a focus on the explainability of the network. The content-based image retrieval module provides an interesting tool to understand which factors lead to the classification and on which parameters the images are considered similar. However, the paper does not exploit this sufficiently to talk about the explainability of the network.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper provided all information to reproduce the works, including the range of hyperparameter used to compute the global loss function. However, the exact values used to compute the results are not specified.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Fig 3 is difficult to read for several reasons. First, the legend and the axis values are not readable. Increase the size of the axis would help to the clarity. If the legend can be having a higher size, it would be possible to provide it in the figure caption. Secondly, currently, each subfigure has is how range, which does not allow easy comparison between each network. In the rest of the paper, the authors only focused on DenseNET-121, one other solution could be to only present the result for this network and provided the other one as supplementary material.

The range of hyperparameters test is very complete, however, it is not specified for with values the results are given, except for tau_d in table 1. One of the evaluation metrics is precision@K, it is not clear what it refers. It is the mean precision on the K images provides by the CBIR?

Fig4 presents the embedding space obtained with each loss with DenseNet-121. The authors compared it to the left part of Fig1. However, it is not clear from with model the embedding space is obtained on FIG1. This needs to be clarified when Fig1 is mentioned.

Due to the unavailability of the test set ground truth, the authors are not able to compute the recall and the precision@K. But it is not clear which metric is used in Table 1 for the test set. Moreover, to allow comparison with the result on validation it could be interesting to compute the same metric on the validation set.

The authors present a global loss function, but this one is not used in the result section.

In Fig 5, it is not clear with the row is corresponds with or without regularization. The authors make reference to Fig4 for color legend. On this figure the legend is small. It would be better to refer to Fig1 for these booth figures.

In section 3.1, there is an empty reference after de data set splitting.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper needs a lot of improvements to be accepted especially in the results section and on the exploitation of explainability. The strengths do not compensate the weakness.
What is the ranking of this paper in your review stack?

6
Number of papers in your stack

7
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper explores the explainability of image-based skin cancer diagnosis by incorporating a content-based image retrieval (CBIR) module, which is logical and easy to follow. As the main concern, the methodological novelty is limited and the results of this paper are not described clearly. For example, It is unclear if only one type of loss function is used or it is a combination of multiple loss functions. Also, the results seem not provided the performance of the use of the four losses. As the used four losses, this article does not introduce their specific functions in detail to explain which problem is to be solved in the classification task. According to the comments of the reviewers, the author is requested to revise the reviewers’ comments point by point. Therfore, a rebuttal is appropriate.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We would like to thank the reviewers for their assessment of the paper and constructive suggestions. Below we address the major concerns and clarify misunderstandings.

We acknowledge that the manuscript was not clear regarding the results and the loss function used to train the model, mainly due to Table 1 and Fig. 3. First, we would like to clarify the meaning of “@GAP” in the results. As described in the last paragraph of Section 2.1, the proposed model simultaneously performs classification and CBIR. The former is provided by the “FC Classifier” head in Fig. 2, while for the latter we explored two alternatives: a) using the “FC Embedding” head (which is only used when the Triplet and/or Contrastive loss are considered); or b) using the embeddings given by the GAP layer. In the results section, we referred to option “b)” as @GAP, and we found this to lead to the best performing and more stable CBIR, when compared to the “FC Embedding”, as shown in Fig. 3. Nevertheless, we will add a more detailed quantitative comparison (as in Table 1) between these two options in supplementary material.

Second, the global loss specified in eq (2) comprises a term (L_CE) which is always used to train the classifier, and three regularization terms whose contribution is controlled by the hyperparameters (α,β, γ). In Fig. 3, the first row shows the results of combining L_CE with each regularization term (e.g., =1,β=0, γ=0). The second row shows the results of combining multiple regularization terms, including using all four (in brown and pink), illustrating their synergistic effect. Table 1 shows the following results: classification loss alone (L_CE), the best combination of loss terms (L_CE + L_Cont + L_Dist) and their individual counterparts (L_CE + L_Dist and L_CE + L_Cont). In all these cases, the CBIR was obtained using the embeddings of the GAP layer (“@GAP” was mistakenly missing in the first two columns). We will address these issues by following the reviewers’ suggestions, namely: 1) adding the results for the missing combinations of terms in eq (2) – in supplementary material, due to the lack of space in the main document; 2) clarifying the definition of eq (2) in Section 2.1; and 3) improving readability of Fig. 3 and the captions of Table 1.

We agree with the reviewer that exploring CBIR for explainability is not new. In fact, Section 1 identifies previous works that use CBIR on CNN features learned for classification tasks. However, they led to conflicting results about the utility of CBIR. In our paper, we argue that this arises from the lack of structure of the feature space learned for a diagnostic task, which does not account for similarities between lesions, as exemplified in Fig. 1.

To the best of our knowledge, ours is the first work to address this issue by augmenting the standard cross-entropy loss function with regularization terms. This encourages the model to also learn similarities between dermoscopy images. Fig. 3 and Table 1 demonstrate the validity of our formulation over the standard CBIR performed on features learned for the classification task (L_CE column vs. the remaining). We show that using regularization terms improves both the class recall and precision@K metrics. Additionally, two of the regularization losses explored in our work (contrastive and distillation) have never been explored as mechanisms to improve the CBIR in skin cancer. The third loss (triplet) has only been used in [1], which is conceptually different from our work. Their work aims to develop a model for retrieval only, while our model is capable of performing retrieval and classification simultaneously, as mentioned in the last paragraph of page 2 (Section 1). Our results also showed that triplet loss is the least suitable strategy to improve CBIR when using small batch sizes. We will make the differences between our work and [1] clearer in the paper.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper simultaneously performs the classification of images and retrieves similar images for skin cancer diagnosis. By this way, it can help doctors understand their decisions, and allows less experienced practitioners to improve their knowledge. Thus, this method has important clinical application value. In addition, the author has clarified the main issues raised by the reviewers such as adding the results for the missing combinations of Eq. (2) in supplementary material and improving the readability of Fig. 3 and the captions of Table 1.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

One of the concerns for this paper is that the methodological novelty is limited. The rebuttal does not address this point. However it addresses all other concerns. Especially those of the loss function. This paper presents an important clinical application and is thus appropriate for MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The problem of improving the explainability of CNN is an important topic for skin cancer diagnosis. Unfortunatley, the motivation of using 4 losses is still unclear; the authors did not clearly explain nor discuss how / why such losses (e.g., contrastive and distillation loss) improves the skin cancer diagnosis.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

14

back to top

Improving the Explainability of Skin Cancer Diagnosis Using CBIR