Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Kanghao Chen, Yifan Mao, Huijuan Lu, Chenghua Zeng, Ruixuan Wang, Wei-Shi Zheng

# Abstract

Intelligent diagnosis is often biased toward common diseases due to data imbalance between common and rare diseases. Such bias may still exist even after applying re-balancing strategies during model training. To further alleviate the bias, we propose a novel method which works not in the training but in the inference phase. For any test input data, based on the difference between the temperature-tuned classifier output and a target probability distribution derived from the inverse frequency of different diseases, the input data can be slightly perturbed in a way similar to adversarial learning. The classifier prediction for the perturbed input would become less biased toward common diseases compared to that for the original one. The proposed inference-phase method can be naturally combined with any training-phase re-balancing strategies. Extensive evaluations on three different medical image classification tasks and three classifier backbones support that our method consistently improves the performance of the classifier which even has been trained by any re-balancing strategy. The performance improvement is substantial particularly on minority classes, confirming the effectiveness of the proposed method in alleviating the classifier bias toward dominant classes.

SharedIt: https://rdcu.be/cyl6d

N/A

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

The paper proposes a method that can alleviate class imbalance, which is common in medical imaging datasets. Specifically, the input data is slightly perturbed at inference time, using a method inspired by adversarial example generation. The perturbation is calculated based on the difference of the temperature-scaled classifier logits and a target probability distribution that is smoother across classes and allows the minority classes to dominate.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The method is very interesting. It draws inspiration from adversarial perturbations and allows models that have unavoidably overfitted to highly represented classes to improve their performance in underrepresented classes by using perturbed representations for inference.
• Experiments on 3 datasets show that the improvement brought by the proposed method is consistent.
• The method can be combined with other strategies to alleviate class imbalance successfully.
• The paper is well-written and easy to follow.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• The related work section is focused on methods that combat class imbalance and there is no mention or introduction of adversarial perturbations. However, later in the manuscript they are a substantial part of the method. Therefore, introducing them and defining them prior to the methodology would be crucial for the reader.
• The proposed method has similarities with defensive distillation [1] regarding the temperature scaling for softmax. There are substantial differences and they are used for different purposes, but I think it would be crucial to mention and cite this work. [1] Papernot, N., McDaniel, P., Wu, X., Jha, S., & Swami, A. (2016, May). Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP) (pp. 582-597). IEEE.
• In the Discussion it is mentioned that ‘…besides cross entropy for the difference measure, some other choices including the mean squared error and focal loss were also tried…’. However, the results for this statement are not shown. The exact numbers could be mentioned in the text for completion.
• The caption of Fig. 1 should be expanded to provide more details, notation and an outline of the method.
• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
• There is no mention in the manuscript that the code and models will become available upon acceptance.
• The hardware utilized for training was not reported, along with the memory footprint.
• There is no statistical significance reported for the results.
• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
• In the last sentence of the abstract it stated that the performance improvement is significant, but this is not shown with statistical tests, therefore I would replace it with the word ‘substantial’.
• Showing the confusion matrices for each dataset would highlight the improvement in each class individually and further show the potential of the method.
• In the first paragraph of the Introduction it is stated that ‘…the intelligent system especially for those rare fatal diseases…’. A disease can be rare without being fatal and the proposed method could be applied to any form of class imbalanced datasets so I would remove the word ‘fatal’.
• In 3.1 Experimental Settings there is a typographical error in the first sentence: ‘The proposed method were - was…’
• In the conclusion there is the following error ‘…with existing methods always further alleviate – alleviated…’
• In the first paragraph of the Methodology section it is stated that ‘…the classifier would be slightly induced toward minority classes…’ I would replace ‘induced’ with ‘inclined’.
• The time required to perform inference with the proposed method is not mentioned. I don’t expect the perturbation step to add a significant overhead to each test sample inference, but since it’s a post-hoc step it would be interesting to clarify this detail specially to show the potential of the method for clinical deployment.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is novel and interesting. It is well explained and evaluated and could be combined with other methods to further improve the results.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

This paper proposes inference-phase input perturbation for avoiding bias in imbalanced medical image classifier. Evaluation is reported on three different classification tasks.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• Clearly discussed background and motivation.
• Innovative technique of alleviating data imbalance issue.
• Experimental setup is good; performance evaluation comparing against a number of methods is outstanding.
• Hyperparameter robustness is well depicted.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• Inconsistent input preparation across the datasets.
• Inference would require multiple passes, potentially slower execution.
• Improving classification performance on minority classes without compromising performance on the dominant classes would be more viable.
• It should be clarified in Fig. 1 which components are used during training/testing. Also, Fig. 1 is never cited in the text.
• Please rate the clarity and organization of this paper

Satisfactory

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It will be hard to reproduce the results.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

-“In this study, inspired by the strategy of generating adversarial examples”-The authors should add the relevant citation.

• A different acronym could be used for focal loss. ‘FC’ is bit misleading.
• “While such perturbation could downgrade the classification performance on dominant classes, it largely improves the performance on minority classes and the overall classification performance.” It would be nice to see the class-specific performance comparisons, at least for one of the datasets.
• If I am an user, how much this system could be trusted? Especially when performance on some classes is compromised for some others.
• The authors should report the class distribution in each of the datasets.
• $\epsilon$ is data-specific.
• It would be interesting to know the time burden during inference with the proposed method.
• The Skin7 and OCTMNIST classes should be mentioned.
• OCTMNIST dataset is relatively less imbalanced and has more train data, but the classification performance is relatively poorer compared to Skin7. Is it an implicit assumption that the training data is highly imbalanced.
• In Table 1, SmallestClass and LargestClass should be #SmallestClass and #LargestClass. Providing class distribution for each dataset would be more helpful. -No mention of any limitation of the proposed method.

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the experimental design and presentation of results are good, there are some concerns on the methodological feasibility and result interpretation.

• What is the ranking of this paper in your review stack?

4

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #3

• Please describe the contribution of the paper

This work mainly proposes a novel method to address class-imbalance issue in intelligent diagnosis. Different from most existing methods, this method only works in the test phase, and each test example is perturbed to make its classification prediction slightly toward minority classes, i.e., using the prediction of its adversarial / perturbed example as the final prediction. The authors validate the method on three class-imbalanced diagnosis tasks, and the results outperform different re-balancing baseline methods that work in the training phase.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The proposed method is architecture-agnostic, which means that it can be directly applied to any off-the-shelf rebalancing models under various tasks. 2) The extensive experiments on various datasets and network architectures clearly demonstrate the advantages of the proposed method in long-tailed diagnosis tasks. 3) The paper is clearly written.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) The proposed inference-phase method involves more computational burden than traditional training-phase methods, including: one more feed-forward and back-forward phase for each test example. The authors should discuss the calculation efficiency in the paper. 2) According to my understanding, the perturbation loss, i.e., \ell in Eq. (3), was kept the same as the training loss. What if the perturbation loss is different from the training loss? 3) In Table 2., the results on OCTMNIST seems to be correct to one decimal place. If not, why the second one is always zero in the table?

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

1) The proposed method has been validated on three public datasets. 2) The authors promised to release the source code, once the paper is accepted. 3) The implementation details, including hyper-parameter configuration, are enough for the reproducibility.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

1) Addressing the class-imbalance issue from a post-hoc logit adjustment perspective is insteresting. Compared to recent methods, such as [17, 19], the proposed method involves more computation burden. Maybe the authors could only perturb certain intermediate features, which will be helpful to improve computation effectiveness. On the other hand, the deep features have rich semantic information, and directly perturbing the features may present more explanation about the method. 2) The core idea of this paper is a kind of data augmentation to some extent, and it may be directly used in the training phase for computation effectiveness. 3) It will be more helpful for readers to understand the changes after perturbing, e.g., whether the pixel-wise perturbation will affect the semantics of the original inputs, and how the perturbation changes the lesion areas.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The basic idea is interesting, and the experiments are sufficient.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a inference-time perturbation to address data imbalance between classes. The reviewers thought that the method is innovative and the evaluation was sufficient. However they also raised a number of concerns including but not limited to regarding lack of reproducibility, potential lowered performance on dominant classes, lack of a discussion of related work on adversarial perturbations and lack of clarity in certain respects. The authors should carefully address these criticisms and questions in their rebuttal.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

# Author Feedback

Q1 (Reviewer 1, 2, 3): Time burden should be mentioned. R1: Our method is applied in the inference phase, which inevitably takes slightly extra inference time. On single NVIDIA 2080Ti GPU, without any code optimization, the average inference time per image is 0.283 second from our method and 0.107 second from the corresponding baseline. Such near-real-time inference and the slight difference would not affect the potential clinical deployment of the method.

Q2 (Reviewers 1, 2, 3): About potential lowered performance on dominant classes. R2: It is common that the increased performance on minority classes is accompanied by slightly lowered performance on dominant classes. This is also observed from state-of-the-art methods (e.g., LDAM). It is worth noting that our method obtains substantial improvement on small classes (e.g., recall from 75.66% to 84.36% on the smallest class of Skin7), with certain decreased performance on dominant classes (e.g., recall from 97.12% to 92.82% on the largest class of Skin7). Confusion matrix and class-specific performance will be included as well if space is available.

Q3 (Reviewers 1, 2): About lack of reproducibility. R3: We promised to release the code on Reproducibility Response, and will release the code with paper published. Section 3 also contains detailed hyper-parameter setting.

Q4 (Reviewers 1, 2): No mention and related work of adversarial perturbations. R4: Similar to existing adversarial perturbation studies, our method modifies input based on gradient of a loss function over input pixels. Differently, our method aims to alleviate the bias across classes, while related work aims to lead the model to make mistakes. We will include the comparison and related work.

Q5 (Reviewers 1, 3): What if applying other perturbation loss during inference? R5: MSE loss and focal loss were tried as mentioned in Section 3.3, additionally together with the cross-entropy training loss. Results are consistent with those reported in the original version and will be included in the new version.

Q6 (Reviewers 1, 2): Provide more details for Fig. 1 and clarify which components are for training/testing. R6: Fig.1 demonstrates the inference procedure for any test image, and therefore all components in Fig.1 are for testing. During inference, perturbation is computed based on gradient of perturbation loss over input pixels, and the perturbed input is then fed to the CNN model to obtain the final prediction. The details will be included in Fig.1 caption.

Q7 (Reviewer 2): Inconsistent input preparation across the datasets. R7: This is due to different image sizes and aspect ratios across datasets. In particular, each OCTMNIST image is small (2828 pixels) and not necessary to be resized to 224224.

Q8 (Reviewer 2):  ε is data-specific. R8: ε is dataset-specific (not data-specific) and determined by a small validation set (Section 3.1), which is commonly used for hyper-parameter setting. In addition, as shown in Section 3.3, ε is robust across baseline methods and backbones.

Q9 (Reviewer 2): Is it an implicit assumption that the training data is highly imbalanced? R9: No. The relatively poor performance on OCTMNIST is probably due to smaller image sizes (28*28 pixels) rather than the less imbalance between classes. Actually, the improvement on minority classes of OCTMNIST is higher than those from the other two datasets.

Q10 (Reviewer 2): Not mention method limitations. R10: The main limitation is slightly more time consumption (about 0.18 second per image) in inference.

Q11 (Reviewer 3): Results on OCTMNIST seems to be corrected to one decimal place. R11: OCTMNIST contains 250 test images for each of 4 classes, therefore totally 1000 test images. Erroneous prediction of one test image contributes 0.1% to the final prediction error. So one decimal is enough and precise (i.e., not due to correction).

Suggestions and other comments will be carefully considered as well in the new version.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addressed issues raised by the reviewers. One of the most important concerns was computational expense which is understood to be slightly higher than baseline but not prohibitively so. The drop in performance for the majority class is another drawback of the method but not surprising in approaches that aim for a balanced training. Overall the idea is innovative and the rebuttal did a good job of addressing concerns.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I think the authors have done a good job of addressing each cricism point by point. And i think with this information cleared out and the fact taht the idea is novel and the evaluation is good, I would accept this paper.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors proposed an interesting approach for alleviating the problem of data imbalance. The raised concerns during review process were mostly addressed in the rebuttal. The authors are encouraged to address those in the final paper as well.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8