Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiaohan Xing, Yuenan Hou, Hang Li, Yixuan Yuan, Hongsheng Li, Max Q.-H. Meng

Abstract

The amount of medical images for training deep classification models is typically very scarce, making these deep models prone to overfit the training data. Studies showed that knowledge distillation (KD), especially the mean-teacher framework which is more robust to perturbations, can help mitigate the over-fitting effect. However, directly transferring KD from computer vision to medical image classification yields inferior performance as medical images suffer from higher intra-class variance and class imbalance. To address these issues, we propose a novel Categorical Relation-preserving Contrastive Knowledge Distillation (CRCKD) algorithm, which takes the commonly used mean-teacher model as the supervisor. Specifically, we propose a novel Class-guided Contrastive Distillation (CCD) module to pull closer positive image pairs from the same class in the teacher and student models, while pushing apart negative image pairs from different classes. With this regularization, the feature distribution of the student model shows higher intra-class similarity and inter-class variance. Besides, we propose a Categorical Relation Preserving (CRP) loss to distill the teacher’s relational knowledge in a robust and class-balanced manner. With the contribution of the CCD and CRP, our CRCKD algorithm can distill the relational knowledge more comprehensively. Extensive experiments on the HAM10000 and APTOS datasets demonstrate the superiority of the proposed CRCKD method. The source code is available at https://github.com/hathawayxxh/CRCKD.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_16

SharedIt: https://rdcu.be/cyl5O

Link to the code repository

https://github.com/hathawayxxh/CRCKD

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a contrastive knowledge distillation(KD) framework utilizaing the relationship among target classes. The KD framework mainly consists of two modules, the class-guided contrasitve distiallation(CCD) and categorical relation preserving(CRP) module. The CCD module guides the features of student network by increasing the intra-class similarity of features from student and teacher and decreasing the inter-class similarity. The CRP module tries to resolve the class-imblance problem which is common for medical dataset by constructing relation graph from centrioids of each class. The effectiveness of the proposed framework was evaluated with two well known public medical image dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method is simple but straightforward way to induce better representation in KD regime.
    • The paper is well organized and description of the method is clear.
    • The experimental results supports the validity of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The dataset selected for the expriment is not optimal to demonstrate the class imbalance problem of medical dataset which is the main motivation of the work.
    • Clinical signifcance of the study is limited as the proposed method is not externally validated nor evaluated with more complex images such as chest radiographs.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    -The dataset used in the experiments are publicly accessible dataset. -The authors declares they will share all the codes for the experiement when accepted.

    • The authors also presented most of the relevant setting parameters for the expriments.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • As the main motivation of the paper is to resolve class-imbalcne problem of the medical dataset, at least one experiment with severly imbalanced dataset could have demonsrated more clearly the validity of the proposed method.
    • It would be also interesting how severe class imblance the proposed method can handle by controlling the level of imbalnce of training dataset.
    • It would be also point of interest whether the CRP module induces more meaningful representation or relation among classes, e.g. ordinal relationship among classes in APTPOS dataset.
  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The motivation of the work is clear and relevant to medical image dataset.
    • The proposed method is simple but reasonable approach to resolve exsting problem.
    • The experiments well demonsrate the effectiveness of the proposed method with some limits.
  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    7

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The topic of the paper is the training of a medical classification network using knowledge distillation (KD) to avoid overfitting. The authors claim that – while KD has been already widely used for medical images – KD is not yet adapted to important challenges when working with medical images. They propose a learning framework to address intra-class variance and class imbalance with KD. The proposed approach is evaluated using two public datasets and compared with baseline approaches as well as other approaches from the literature.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    An important strength of this paper is that knowledge distillation is an interesting technique and additional research in this field is highly welcome and of high interest to the reader. Also, the authors identified some points where they see the need for improvement of the current state of the art. Last, but not least, the developed approach seems to be well-engineered.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For me, the main weakness of this contribution is complexity. The proposed approach consists of several different techniques and borrows from other papers. While this isn’t limiting the value of the paper, it makes it quite difficult to read and to extract important information from it. The authors had to make some trade-offs due to the limited space that significantly affect the quality of the paper. Or, to rephrase it, I think that the paper is very complex for a MICCAI paper and might be better published as a journal paper.
    Another weakness is the evaluation. The reported evaluation was obtained only on a 5-fold-CV. This gives room for significant methodical overfitting. Especially, since the authors did not explain how they obtained or tuned their meta-parameter. This, in combination with the rather small increase, raises doubts about the generalizability of the findings.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    In its current format, the paper should be sufficiently reproducible. Enough details are described and public data has been used. The authors state that the code will be available in the checklist, however, the paper lacks any information about this. A big problem is that the authors do not describe how the hyper-parameters were determined.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The paper is well written, however, I found that there are still some points for improvements: 1) As already stated before: The paper is quite complex for a MICCAI paper. It is therefore lacking important information. While it is not possible to add this information to a MICCAI paper, it should be included in any journal version, for example: a. Do not offer only the formulas, but give an idea of what you intend with the math behind it / what the understanding of it is. Give an interpretation of the formulas. b. Spend more space on explaining the workflow and give more details about the learning strategy. c. Discuss the pro and cons of the proposed approach. Rarely any approach has just benefits, what are the downsides? 2) I am not a complete expert on Knowledge Distribution. But as far as I know, it needs the distribution of knowledge from a complex to a simple model. While the proposed approach rather follows the “Learning with Teacher” setting. While this is similar to KD, it has a different aim. For me, this was confusing, especially since some of the major differences between both techniques are ignored. 3) Please explain how the meta-parameters like learning-rate, no. of layers, temperature, etc.. were obtained. Were they tuned using a single fold of the 5-fold-cv? Were they set in advance? (Really?) Were they optimized over all folds? 4) Reporting only the results of a 5-fold-cv leads to the possibility of overfitting. Was this somehow controlled for? Were other measures taken to avoid this? 5) I would definitively love a more detailed discussion. The most important thing in a paper is for me the discussion, and the current one is rather limited. 6) Include AUC as Score Metric and provide some confidence intervals or similar.

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is interesting and worth presenting. But the short format of a MICCAI paper might not be the best option, as the compact format requires too many compromises. Also, the evaluation of the paper might be affected by overfitting.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    In the paper at hand, the authors propose improvements to student-teacher knowledge distillation approaches. The authors rely on the bad-teacher approach and try to improve it with respect to two crucial aspects of medical image classification problems: 1) intra-class variance/inter-class similarity 2) class imbalance. For 1), the authors propose a contrastive learning loss that uses classes to define positive/negative pairs. For 2), the authors enforce per-class similarity of features learned by student and teacher. The method is evaluated on two public datasets. An improvement over previous methods is demonstrated.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The motivation, problem description, and the proposed solution are very clear • The paper is well-written and pleasant to read • The two proposed losses are well explained and plausible • A lot of details are provided, potentially aiding reuse/re-implementation by others • The authors evaluate on two public datasets • The authors compare to and outperform other relevant approaches

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The authors provide a lot of details on their hyperparameters which excellent and should help re-implementations. However, it is unclear how the authors chose the specific values of all these hyperparameters. In particular, a dedicated validation set or extra validation CV-splits are not mentioned. How did the authors determine these hyperparameters? Tuning on the final sets that were used for reporting might indicate implicit overfitting. • The authors do not provide significance tests or confidence intervals. The authors appear to provide standard deviations across CV folds in Table 2, however, this is not specified.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    • The authors provide a lot of details on hyperparameters. However, there appears to be no intend to publish the paper’s code, hindering reproducibility. • The authors use two public datasets which would help reproducing their results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    • See weaknesses for the most important comments. • In general, in think there is not much that needs to be fixed. The only crucial point for me is hyperparameter selection. • More details on the weighted cross-entropy loss would be helpful. What did the authors use for weighting? The class (inverse) class frequency? • A common way to address class imbalance is balanced batch sampling. This would also counter one of the problems the authors mentioned with respect to [13,17] where relation in batches is used. An experiment on this would be interesting, e.g. in a follow-up journal paper.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written, sound in terms of methodology, and the results are interesting. I believe this paper would be interesting for the MICCAI community.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work proposed a categorical relation-preserving contrastive knowledge distillation (KD) algorithm to apply KD to medical image while addressing the unique feature of medical images: high intra-class variance and class-imbalance. They took mean-teacher model as the supervisor, proposing class-guided contrastive distillation module to pull closer positive image pairs from the same class in the teacher and student models, while pushing apart negative image pairs from different classes. The work is particularly interesting to medical imaging research.

    The reviewers asked many details questions, e.g., how to set up the hyperparameters? pros and cons of the proposed approach; details on the weighted cross-entropy loss, etc. Please try to clarify them in the rebuttal. It will further enhance the current manuscript.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1




Author Feedback

We thank all Reviewers for their comments and agreement on the novelty of this paper. Before we present our responses, we would like to point out that the information about class imbalance (R#1) and standard deviation (R#3 and R#4) are provided in the supplementary file of previous submission.

AC & R#3 & R#4: hyperparameter selection. The temperature and no. of negative pairs followed the settings in Ref. [18]. Other parameters (no. of epochs, lr, batch size, no. of positive pairs) were tuned before the 5-fold cv. In specific, we utilized the dataset (70% train+10% validation+20% test) split in Ref. [13] and tuned hyperparameters on the 10% validation set. Once the hyperparameters were determined, we randomly split the dataset into five folds and conducted 5-fold cv.

AC & R#3: pros and cons. Pros: Our method is solid and well-motivated by the challenges (intra-inter class similarity, class imbalance) in medical datasets. The proposed CCD module enlarges intra-class feature similarity and inter-class variance, the CRP loss tackles the class imbalance issue in relational KD. These two modules were well-engineered and achieved performance gains on two public datasets. Cons: To compute the CCD and CRP losses, our method built memory banks to store feature embeddings, thus requiring larger GPU memory. Fortunately, inference speed is not affected since only the student model is kept in the testing stage.

AC & R#4: weights used for the weighted CE loss? The weight for each class was inverse proportional to the class frequency (total no. of images in the dataset /no. of images in the class).

R#1: should use severely imbalanced dataset for evaluation. The two datasets used in this paper are both severely imbalanced (see Table 1-2 in the supplementary file). In the HAM10000 dataset, the imbalance ratio (no. of images in majority class/ no. of images in minority class) is 58. The imbalance ratio of the APTOS dataset is 9. Experiments on these two imbalanced datasets verified the efficacy of our method.

R#3: the paper is interesting but too complex for MICCAI, should be published as a journal paper. We thank the R#3 for admitting the novelty of our work. However, we disagree with the R#3 for judging this paper as “too complex and hard to read”. Another two Reviewers rated the paper as “well-organized, method description is clear” (R#1), “well-written, well explained, pleasant to read” (R#4). In spite of limited space, this paper included all necessary information. Some comments of R#3 indicate that he/she may overlook the information in the supplementary file thus misunderstand some points of this paper. Upon acceptance, we will release the codes for clarity and reproducibility of the paper.

R#3: 5-fold cv may cause overfitting. k-fold cv is widely used to avoid overfitting when evaluating models with limited datasets. In our paper, the hyperparameters were tuned on a randomly selected validation set rather than the testing folds, thus the reported 5-fold cv results were not affected by overfitting.

R#3 & R#4: confidence intervals (CI) & significance tests. For all experiments, we reported the mean±standard deviation (stdev) in the paper (Table 2) and supplementary file (Table 3-4), by which the CIs can be easily computed. For instance, the 95% CIs of our method on HAM10000 are acc:[84.84,86.5], AP:[75.5,77.2], BMA:[76.95,79.19], F1:[75.88,77.02]. The variance of each metric is caused by the varying difficulties of different test folds. For almost each metric of each fold, our method outperforms the baselines and SOTA methods. Following R#3’s suggestion, we conducted significance test (t-test). For HAM10000 dataset, the p-value for “our method > B2” are 0.01(acc), 0.05(AP), 0.05(BMA), and 0.04(F1). For APTOS dataset, the p-value for “our method > B2” are 0.015, 0.05, 0.04, 0.02. These results indicate that our method outperforms B2 significantly. P-values for other comparison experiments will be added in the supplementary file




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The work proposed a categorical relation-preserving contrastive knowledge distillation (KD) algorithm to apply KD to medical image while addressing the unique feature of medical images: high intra-class variance and class-imbalance. They took mean-teacher model as the supervisor, proposing class-guided contrastive distillation module to pull closer positive image pairs from the same class in the teacher and student models, while pushing apart negative image pairs from different classes. The work is particularly interesting to medical imaging research.

    In their original submission, some key experimental setting and performance metrics are missing. The authors did a good job in the rebuttal in explaining them in details.

    Among three reviewers, R#3 raised a concern about the paper accessibility. With the supplementary materials together with new materials in the rebuttal, this problem may be alleviated. Therefore, an “Accept” recommendation is made to recognize the novelty, its unique design for medical images, and solid experimental results.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a contrastive knowledge distillation(KD) framework utilizaing the relationship among target classes to resolve multi-classification task. Experimental results on two public datasets demonstrate that the proposed method is effective for the distribution of data-imblance. After reading the rebuttal, I think the authors have answered the reviewers’ question and this paper should be accepted in MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studies the medical image classification problem by using a contrastive knowledge distillation(KD) framework that make use of the relationship among target classes. The idea makes sense, which has been validated by the experiments. The rebuttal from the authors have largely addressed the questions and concerns raised by the respective reviewers. So I recommend an acceptance to this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3



back to top