Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhen Yu, Victoria Mar, Anders Eriksson, Shakes Chandra, Paul Bonnington, Lei Zhang, Zongyuan Ge

Abstract

The concept of ugly ducklings was introduced in dermatology to improve the likelihood of detecting melanoma by comparing a suspicious lesion against its surrounding lesions. The ugly duckling sign suggests nevi in the same individual tend to resemble one another while malignant melanoma often deviates from this nevus pattern. Differentiating the ugly duckling sign was more discriminatory between malignant melanoma and other nevi than quantitatively assessing dermoscopic patterns. In this study, we propose a framework for modeling ugly duckling context early melanoma identification (called UDTR hereafter). To this end, we construct our model in three parts: Firstly, we extract multi-scale features using a deep neural network from lesions in the same individuals; Then, we learn lesion context by modeling the dependency among features of lesions using a transformer encoder; Finally, we design a two branch architecture for performing both patient-level prediction and lesion-level prediction concurrently. Also, we propose a group contrastive learning strategy to enforce a large margin between benign and malignant lesions in feature space for better contextual feature learning. We evaluate our method on ISIC 2020 dataset which consists of ∼30,000 images from ∼2,000 patients. Extensive experiments evidence the effectiveness of our approach and highlight the importance of detecting lesions with clues from surrounding lesions than that of only evaluating lesion in question.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_17

SharedIt: https://rdcu.be/cyl8c

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper at hand addresses the problem of melanoma detection from dermoscopic images. The authors aim to incorporate the ugly duckling principle into a deep learning model by capturing the context with multiple lesions from one patient. For this purpose, the authors employ a transformer that processes pooled feature vectors for each lesion from an encoding CNN as its tokens. Furthermore, the authors use a lesion-level loss, a patient-level loss, and a contrastive loss between features of benign and malignant lesions. The authors use the publicly available ISIC 2020 dataset. Multiple ablation experiments are performed, showing improved performance compared to a lesion-wise CNN.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The idea of using context in terms of multiple lesions is well motivated. • Transformers are excellently suited for the proposed problem. • The authors clearly describe their validation/testing strategy. • The authors provide ablation experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The author’s way of presenting results is partially confusing: o The authors should state which model variant they used in Table 1. The reader can guess this from comparing values but this should be stated. o The authors leave out the patient-level result in Table 2 without clearly stating that. o The authors leave out sensitivity and specificity in Table 3. • There is no proper comparison to other methods in the literature: o Where do “score post-processing” and “feature post-processing” come from? Did the authors create these? If yes, why compare to them if they do not work? o The ISIC 2020 challenge provides plenty of methods to compare against. I acknowledge that a lot of them use lots of tricks to achieve high performance but the authors could have reproduced one of those methods in their settings (only 10crop as test augmentation) • The value of Fig. 3 and 4 is unclear: o Fig. 3: uncalibrated softmax probabilities are generally a suboptimal measure for model uncertainty – what is the main takeaway from this figure? o Fig. 4: The authors do not provide an interpretation of the visualized attention weights. To the reader, it looks like few, arbitrary lesions (not the malignant one) are selected as being important by the attention weights. • The authors do not provide significance tests (they do provide CV standard deviations, however)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    • The authors evaluate on a public dataset (ISIC 2020 challenge), helping reproducibility • The authors do not comment on the release of their code, hindering reproducibility

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    • In general, I believe that the authors approach is very reasonable and transformers are well-suited to incorporate the ugly-duckling principle. However, there are multiple points the authors should fix (presentation/description of results) and discuss (Fig. 3, 4, related work). See weaknesses for details. • Something else that is missing for me is a set of standard aggregation techniques for patient-level predictions, e.g., in terms of ensembling (averaging (weighted) probabilities/selecting the largest probability, etc.). This could have been done for the standard CNN. • The dataset is extremely unbalanced. The authors appear to achieve balanced metric results (sens./spec.), however, I cannot find any mentioning of balancing techniques such as loss weighting or balanced batch sampling. What did the authors use to account for class imbalance?

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My rating is borderline. In general, the approach is very interesting for the MICCAI community, however, the authors should improve their results section/discussion before publication.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The author proposed a framework named UDTR hereafter for modeling ugly duckling context in melanoma identification (called UDTR hereafter). They first extract multi-scale features using a deep neural network from lesions in the same individuals; Then, they learn lesion context by modeling the dependency among features of lesions using a transformer encoder; Finally, they design a two-branch architecture for performing both patient-level prediction and lesion-level prediction concurrently. Also, they propose a group contrastive learning strategy to enforce a large margin between benign and malignant lesions in feature space for better contextual feature learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clear motivation and sufficient literature review.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    4.1 Please add detailed description in caption of Fig.2. 4.2 In Fig.2, Encoder layer corresponding to which part in the left architecture figure? 4.3 What is insight of using group contrastive learning instead of image contrastive learning? image contrastive learning means do contrastive calculation in image by image without group them. I think the image contrastive learning is a strong baseline for comparison. 4.4 How to define N? As it directly related to contrastive learning results. How to adapt this hyper-parameter to different image quantity of dataset?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Figures should add detailed caption also for better readability

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty, solid method.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    7

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    This paper introduces to leverage transformer to detect melanoma by using a patient’s entire dermoscopy images. Initially, CNN was used to encode image features which were then parsed by a transformer to model the dependency relationship. Finally, a two-branched network with contrastive loss function was used for predicting both the lesion and patient level results. The experimental results with the public ISIC 2020 dataset appear to be better than the existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well organized and structured, which made audience easy to follow. The concept of leverage a patient’s all the available lesion images is interesting and novel. There are sufficient experimental results to justify the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems that the proposed method is a straightforward implementation of transformer. Therefore, it’s not clear why it’s novel and why transformer is the optimal choice.

    There is no comparison to the existing methods that are optimized for melanoma detection. Therefore, it’s challenging to understand that the proposed method has improved the state-of-the-art.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed method was developed with a well benchmarked public dataset. The selected hyperparameters were also given in the manuscript.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Both level and patient level prediction dropped after image number 30 (Table 1). Are there any reasons for that? Theoretically, the performance should increase or consistent when the number of images increases.

    It seems that the improvement from using global context is incremental (an increase of 0.5%, Table 2, UDTR-L and UDTR-LP). Is this significant? It will be good to have detections results on CNN predicting patient level melanoma. This will help to understand the importance of global context.

    There are some non-standard hyperparameters used without justification. It will be better if author can explain the selection. E.g., c=1.24 and m=0.3.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The concept of using a patient’s all the available dermoscopy images for detecting melanoma is novel and interesting. Authors also conduct detailed experiments to justify the proposed method.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers acknowledge that the paper introduces an interesting Transformer-based method coupled with the ugly-duckling principle for melanoma detection in dermoscopic images. However, some detailed descriptions of the proposed method are currently missing and presentation of the results also need to be improved.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

Q1. There is no proper comparison to other methods in the literature. It is unclear the proposed mothed has improved state-of-art. Also, why compare to “post-processing methods” if they do not work? Why not compare method from ISIC 2020? (R1 & R3)

Re: As far as we know, no previous research has explored modelling “ugly duckling sign” for melanoma diagnosis. Thus, there is no publicly reported method for comparison. In our study, we proposed two baseline methods for comparison: 1) CNN score-based post-processing; and 2) CNN feature based post-processing.

Since our aim is to compare lesions from the same patient for classifying outlier lesion, performing post-processing either on the CNN scores or CNN features is an intuitive and efficient way to assess relationship among lesions. According to our experiment, although we choose optimal hyper-parameters on validation dataset, their performance drops on the testing dataset, which indicates that these methods are unsuitable for dealing with such complicated task of contextual-based outlier lesion detection.

In this study, we mainly focused on exploring how to effectively model ugly duckling context to improve melanoma diagnosis. We compared our model with basic CNN models under the same standard training and data augmentation strategy on the ISIC 2020 dataset. Results demonstrated the benefit of using lesion context as well as the superiority of our method.

Indeed, there is plenty of methods in ISIC 2020 challenge, but both of top-ranking methods are based on heavy ensemble of multiple models with different architectures or trained with different input size. For instance, the 1st rank method ensembles 18 models for the final prediction. Also, they use external data to improve performance for each single CNN model. These methods are heavily engineered targeting for higher performance rather than clinical-inspired solutions.

Q2: What is insight of using group contrastive learning instead of image contrastive learning? I think the image contrastive learning is a strong baseline for comparison. (R2)

Re: We use group contrastive loss to force a large margin between benign lesions and melanoma in feature space as possible. This enables the following transformer model to capture the relationships of multiple lesions in a global view and make better decisions. While contrastive learning only focuses on learning the correlation between a pair of lesions, which does not suit for the “ugly duckling concept”. Besides, we incorporated a hard sample selection mechanism in our group contrastive learning considering the extremely imbalanced dataset. We also tried directly using image pair-wise contrastive loss, but it didn’t boost the performance.

Q4: How to define N? How to adapt this hyper-parameter to different dataset? (R2)

Re: The N is the input lesion image number of our model which depends on the average lesion number of patients. In ISIC 2020, the average lesions number from each patient is ~16. We performed an ablation study on the N (Table 1) and the results show that increasing N above the average number will not make much difference.

Q5: The dataset is extremely unbalanced. The authors appear to achieve balanced sens./spec., What did the authors use to account for class imbalance? (R1)

Re: The sens/spec results are strongly related to the setting of the threshold. We choose optimal threshold with Youden’s index (sens. + spec. -1) which enables a balanced sens/spec result.

Q6: The value of Fig. 3 and 4 is unclear. (R1) Re: Fig. 3 is used to show how incorporating lesion contextual information changes the prediction scores for a same group of lesions. With lesion context from surrounding lesions, the model tends to give more confident prediction scores. Fig. 4 illustrated the attention weights of our model; the line weights reflect strength of relation between two lesions. Usually a melanoma is less similar to benign lesions and it has weak connection to normal lesions.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors addressed most of major comments in their rebuttal and explained regarding the missing / unclear information. While the limited comparative evaluation is still a concern, the concept of using ugly duckling for melanoma detection was interesting and hence I suggest an acceptance for MICCAI2021.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The submission proposes an ugly duckling sign detection algorithm for melanoma identification based on transformers. The proposed method is a reasonable solution to the targeted problem, and its effectiveness is demonstrated by the reported experimental results. Main concerns regarding the baselines and design choices have been addressed in the rebuttal period. Therefore, this submission is recommended for acceptance. Updates as stated in the rebuttal letter are encouraged to be made to the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed method is of great interest and seems to provide suitable performance. However, some aspects of the experimental design are still unclear after rebuttal. Further, there may be concerns of validity of the study related to the choice of the final threshold used to calculate the evaluation metrics.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    12



back to top