Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yassine Marrakchi, Osama Makansi, Thomas Brox

Abstract

Medical image datasets are hard to collect, expensive to label, and often highly imbalanced. The last issue is underestimated, as typical average metrics hardly reveal that the often very important minority classes have a very low accuracy. In this paper, we address this problem by a feature embedding that balances the classes using contrastive learning as an alternative to the common cross-entropy loss. The approach is largely orthogonal to existing sampling methods and can be easily combined with those. We show on the challenging ISIC2018 and APTOS2019 datasets that the approach improves especially the accuracy of minority classes without negatively affecting the majority ones.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_44

SharedIt: https://rdcu.be/cyl4E

Link to the code repository

https://github.com/YMarrakchi/CICL

Link to the dataset(s)

https://www.kaggle.com/c/aptos2019-blindness-detection/data

https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_Input.zip


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors address the problem of imbalanced distribution of classes in medical datasets. They propose a framework based on contrastive learning to better arrange the feature space for minority and majority classes. The feature space is explicitly separated into different clusters by minimizing the distance between samples from the same class and maximizing the distance between samples from different classes. The proposed method was evaluated on the ISIC2018 lesion diagnosis dataset which contains 7 predefined categories and on the APTOS2019 diabetic retinopathy dataset which contains 5 classes. A comparison with different model-based approaches used for addressing the imbalanced classes was also performed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and well organized. The implementation details are well described. The validation protocol is appropriate and a quantitative evaluation was performed on two different datasets for different applications; skin lesion classification and diabetic retinopathy grading. The results were well discussed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Relevant details are missing in the description of the methodology. How the positive samples and the negative ones are generated in each batch? For the oversampling strategy, how the original dataset is extended from minority classes to get an artificially balanced dataset? Diabetic Retinopathy has been widely researched and there are many more DR-related dataset (FGADR,Messidor, EyePacs…) that could have been used to evaluate the ability of the proposed architecture to generalize on other datasets. The improvement of the F-score evaluation on the minority classes of diabetic retinopathy is missing. Overall, the improvement of the accuracy with respect to the baseline and the previous methods is not significant.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The implementation details are included in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The comparison with the baseline should include a quantitative evaluation of the accuracy on the minority classes in the two datasets. The authors should justify why the resampling harms the performance as shown in table 2. The authors claim that the performance of the proposed method is much better than the cross-entropy baseline across all classes but it is not clear from Figure 3.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution is not original and the results are not convincing.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    Address the imbalanced datasets by using novel contrastive learning framework to better arrange feature space while accounting for majority/minority classes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Excellent introduction clearly setting up problem, need, and solution
    • Great overview of previous work to provide context.
    • Intuitive approach, well-explained difference between contrastive and cross-entropy loss training
    • Reasonable experimental design with different permutations and comparative strategies
    • Results show clear advantages conferred via CL training in different settings and different classes
    • Intuitive results from ablation study
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Fig 2 could be a little more intuitive in illustrating the flow
    • No statistical testing reported
    • Improvements over comparative strategies appear minor
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Data and code will be made available.
    • Detailed design and settings
    • No statistical analysis
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Are there significant differences in performance in Table 2 and 3?
    • Please fix Table 2 to be consistent with decimals/scientific notation
  • Please state your overall opinion of the paper

    strong accept (9)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well written, comprehensive experiments to validate approach. While resulting improvements are marginal, core idea is of interest and addressing a key problem.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    4

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    This paper studies the problem of image classification under class imbalances, in particular in the setting of dermatology, and proposes a method based on a representation learning with contrastive learning followed by a standard classifier.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem addressed in this manuscript is timely and important
    • The presented solution is intuitive, building on extensions of other literature
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper strongly argues that learning under class imbalances has not been greatly studied. While I agree that more work is needed (and this manuscript contributes to that), there’s extensive work that the authors would like to refer to. Besides the works the authors mention, there exist a another approach to learning in these settings, basically by minimizing a loss that is not sensitive to the class imbalance (such as the AUC). See e.g. Zhao, Peilin, et al. “Online AUC maximization.” (2011): 233., and Wang, Guanjin, Kok Wai Wong, and Jie Lu. “AUC-based extreme learning machines for supervised and semi-supervised imbalanced classification.” IEEE Transactions on Systems, Man, and Cybernetics: Systems (2020).
    • Parts of the manuscript are unclear. For example, the text around Eq. (1) that explains their approach is confusing. E.g. “z_i is the normalized feature vector of an image x_i, c_i is the positive set for sample i and corresponds to all samples in the batch with the same class label as i’. I’m confused as to the meaning of “sample i”: is it the image x_i? is it a batch of examples?
    • The numerical results are somewhat unclear: While the proposed method seems to provide benefits in one of the studied datasets, in the other their method does not improve on methods which do not explicitly try to compensate for class imbalance. More importantly, the authors argue that their method provides greater advantage for the minority class, in Figure 3. However, this does not appear to be the case: the largest improvements are obtained for some of the majority classes; and in fact for the most extreme minority case, the method deteriorates w.r.t. to the baseline which does not compensate for class imbalance, which is very strange.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Based on my comments regarding the lack of clarity on the presentation of their method, I believe that the potential for reproducibility is only mild, so implementation details (in terms of hyperparameters and training details) are provided.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • I think the work in [21, in authors’ references] is the conceptually the closest to this work. Is there a reason the authors chose not to compare with them?
    • The difference in general trends among the two datasets is somewhat surprising. Can the authors provide more details about the level of imbalance in one and other case?
    • The authors comment that “Besides the common average accuracy metric which is not sensitive to data imbalance…”. This comment is probably inaccurate, as accuracy is precisely very sensitive to class imbalance (consider e.g. the case where a minority, negative, class is simply 1% of the sample, and a constant classifier that predicts always the majority, positive, class).
    • The authors talk about “disentanglement” in the learned representations. The authors might want to rephrase this, as disentanglement is typically used to refer to representations where input features are coded by almost independent components in feature space, and not just a class separation. See e.g. Tran, Luan, Xi Yin, and Xiaoming Liu. “Disentangled representation learning gan for pose-invariant face recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the approach is somewhat limited, and the experimental comparison is unclear, as are some parts of the methodology.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    4

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper tackles the issue of class imbalance for medical image classificaiton using contrastive learning. The paper seems well written for the most part. However, there are a few clarifications that need to addressed in the final manuscript regarding:

    • including relevant references as suggested by the reviewers
    • discussion on trends observed in results
    • clarification about figures requested by reviewers
    • other writing clarifications requested by reviewers
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    2




Author Feedback

We will revise the text to avoid the sources of confusion spotted by some reviewers, extend the related work section to refer to previous works mentioned by reviewer 3, and will discuss the points that were highlighted by the reviewers and that we addressed in the detailed response below. We must see how we balance these extensions with the other content because of the rather tight page limit.

  • The method is not new: The supervised contrastive loss itself is not new, but its value for classification under data imbalance has not been studied before and our paper focuses specifically on this aspect. Moreover, data imbalance is often ignored in medical imaging, and we highlight its relevance and strategies to fight it. The reviewer is right in the sense that we do not present a new fundamental method, but the paper contributes scientific insights on class imbalance and how to best work with it. We will include the relevant references suggested by reviewer 3.

  • Methodology: How are positives and negatives defined? How does oversampling work? Both questions are addressed in Section 3. We will reformulate the definition of positives and negatives and will fix the typo mentioned by reviewer 3.
    Given a sample image, all other images of the same class in the same batch build the set of positives while all images in the batch that belong to a different class form the set of negatives. Oversampling creates copies from samples of the underrepresented classes, so that the classes are balanced.

  • Why do we not consider another diabetic retinopathy related dataset? Since we are mainly interested in validating the concept, we chose to evaluate on two datasets belonging to different domains. In this way, we guarantee that the reported benefit is due to the method and not to a domain-specific cue. Studying the generalization across datasets from the same domain remains an interesting aspect for further investigation. Since we will provide the code, the approach can be run on other datasets, also non-public ones.

  • Sensitivity of accuracy metric to class imbalance We agree that the formulation of the statement can be confusing and will reformulate it. We meant exactly the same as the reviewer gives as example.

  • There are different improvements on the two datasets in comparison with DANIL, OHEM, MTL and CANet. We point out that all those methods start from a backbone pretrained on ImageNet. Therefore, their performance depends on how well ImageNet fits to the target domain. All those methods fail to even outperform the baseline on ISIC18, while they give a significant improvement on APTOS19. Our method does not have this dependency on a matching pretraining dataset. It rather learns domain specific features from scratch and performs on par or better than those methods.

  • Why does resampling harm the performance in Table 2? We think the drop in performance on APTOS19 is mainly due to the small size of the dataset. Thus, making copies from the very few samples results in overfitting to those.

  • Why is there a decrease in performance for the class of lowest frequency? Our method does not give any guarantee of improvement for all the classes. Given that the 4 least frequent classes have roughly the same number of samples, we get the expected behavior for 3 of them. For all 5 non-major classes that get improved, the improvement is higher than for the majority class.

  • Missing classwise performance on APTOS19: Due to limited space, we present the classwise performance for ISIC18 only. It is more relevant since it is larger, has more classes and has stronger data imbalance. Moreover, we get the same behavior for both datasets.

  • Why is there no comparison to [21]? Since the idea of Huang et al. [21] is an extension of the triplet loss, which has been clearly outperformed by many recent methods, we restricted the comparison to LDAM, one of the most powerful recent methods that relies on a similar concept.



back to top