Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Lie Ju, Xin Wang, Lin Wang, Tongliang Liu, Xin Zhao, Tom Drummond, Dwarikanath Mahapatra, Zongyuan Ge

Abstract

In the real world, medical datasets often exhibit a long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples), which results in a challenging imbalance learning scenario. For example, there are estimated more than 40 different kinds of retinal diseases with variable morbidity, however with more than 30+ conditions are very rare from the global patient cohorts, which results in a typical long-tailed learning problem for deep learning-based screening models. In this study, we propose class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge, such as regions and phenotype information. It enforces the model to focus on learning the subset-specific knowledge. More specifically, there are some relational classes that reside in the fixed retinal regions, or some common pathological features are observed in both the majority and minority conditions. With those subsets learnt teacher models, then we are able to distil the multiple teacher models into a unified model with weighted knowledge distillation loss. The proposed framework proved to be effective for the long-tailed retinal diseases recognition task. The experimental results on two different datasets demonstrate that our method is flexible and can be easily plugged into many other state-of-the-art techniques with significant improvements.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_1

SharedIt: https://rdcu.be/cyl9x

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper presented class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge. It enforces the model to focus on learning the subset-specific knowledge and can be effective for the long-tailed retinal diseases recognition task.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper considers the challenge of multiple labels in long-tailed disease recognition and designs three rules (named shot-based, region-based and feature-based) to divide the long-tailed data into multiple subsets. In the dividing, prior knowledge is considered to enforce each submodel to focus on its specific. The idea is sound and interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The novelty is limited. The motivation dividing long-tailed data into multiple subsets is existing in many published papers, such as [13]. Except for the dividing rules, the proposed method is similar to ref.[13], including formulation of multi-weighting, Eq.1-3. More details about the difference with ref.[13] should be given.
- As stated on p.2, the multi-label challenge is not discussed and can not be decoupled directly by ref.[13], but how the proposed deal with the multi-label challenge has not to be given.
- the evaluation matrix is incomplete. the paper evaluates the disease recognition (class problem) only with mAP. As we all know, mAP is often used to evaluate the localization problem in computer vision. Accuracy, Precision, recall, AUC, F-score, and other metrics should be added to completely demonstrate its effectiveness.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The dividing of long-tailed data is hard to follow because there are three rules. In some scene, those rules are conflect.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

see 4
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is similar to the existing paper, and significant evaluation is missed.
What is the ranking of this paper in your review stack?

6
Number of papers in your stack

6
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper aims to alleviate the negative impact of the long-tailed effect of sample distribution in the training process for retinal disease recognition. Multiple teacher networks are trained using multiple subsets of the whole dataset, and a unified net (served as a student net) is then trained with the supervisions simultaneously from the ground-truth and the teachers networks (weighted knowledge distillation).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

++ The strategy using knowledge distillation from multiple teachers to tackle the long-tailed effect is sensible and interesting. ++ Well written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– Some descriptions and discussions may not be clear enough, e.g., “There are two main advantages: (1)… (2)… ”.

– Some notation definitions may need to be refined, e.g, ‘we divide them into several subsets {… …}’, the subscripts indicate there are only 3 subsets instead of k (Should k be the number of subsets instead of subset ID as suggested ?). Also, under a multi-class setting, it’s better to formulate the prediction of each class as a binary output.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

it might be difficult to reproduce the same results shown in this paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Some detailed information should be explained clearly, for example, what is ERM, RW, how the OLTR is used in this paper, what is the difference between Ours (shot-based) and Ours + OLTR? Overall, too much information is not very clear and it is very difficult to judge whether the performance is convincing or not.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall this is an interesting work, but it is let down by its clarity of presentation and notation definitions.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

A method for handling long-tail disease classification on retinal fundus images is introduced. It is based on clustering classes together based on three different criteria: frequency of the categories (shot), typical localization of lesions (region), and phenotypical manifestations (feature). Different models are trained to distinguish such super-categories, and then weighted knowledge distillation is applied to generate a model that can perform fine-grained classification. Results seem to indicate improvement in performance on the tail part of two datasets, although no improvement is reported on the body and head parts.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The clinical significance of long-tail recognition is large in retinal image analysis, due to the existence of rare pathologies. The idea of grouping together categories and then distill models trained on those grouping into a single model makes intuitive sense, and the benchmarking considered several other state-of-the-art techniques.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of this paper is its confusing technical explanations. When reading this work, it feels as if a longer manuscript has been cut to an 8-page limit, losing in the process several key details that make the understanding quite difficult. For instance, in page 4, one can read:

“subsets can help reduce the label co-occurrence in a multi-label setting. More specifically, for S_original with N_all instances. We randomly select some instances […].”

In this case, the middle sentence is completely meaningless, and makes very hard to understand what is explained at this point.

This confusion is particularly concerning regarding the way different groupings of super-classes were designed. As an example of this, in the region-based explanation of page 4, we can read:

“Here, we divide the lesions/diseases which are frequently appeared in the specific regions into the same subsets, as Table 2 shows.”

Table 2, which only appears in page 7, contains the level of label-coocurrence in each dataset. How can we understand from this the way in which diseases appearing in different locations were clustered? The mechanism for grouping together different localizations is very unclear. Also for the feature-based grouping, we have:

“Many diseases share similar pathological features. […] Therefore, we divide the diseases which share similar semantics into the same class-subsets, and those common features can be shared and learned between the majority and minority.”

There is no detail or explanation on how was this grouping created. What are the pathological features that were considered in order to separate different classes?

In general, the mechanism for creating the relational subsets is a critical component of this work, and it does not seem to have received enough care when describing it (unless the obvious shot-based variant). Without further explanations, this work appears to me as highly irreproducible for practitioners.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Overall, it seems the authors complied with what they were promising in their checklist. This comes with the (relevant) exception here:
- A description of results with central tendency (e.g. mean) & variation (e.g. error bars). The authors answer yes, but there is no variation measure (stds) in their tables.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
As I stated above, I believe the authors should substantially expand their explanations on how they created the regional and feature-based groupings in section 2.1, right now it is very hard to understand. This is, I believe, the main weakness of this paper. Below I give more detail on other less relevant (but important!) weaknesses.
- Calibration seems to be a rather critical aspect of the soft target creation that drives the distillation. Could you please provide clarification on how the hyper-parameter T (temperature) was fixed? This is typically done on the validation set, but since the evaluation here is in a 4-fold cross-validation manner, this is not clear. Did you generate four different models that were then used in the test set one by one? In this case, why not reporting mean +- standard deviation using those four values?
- The discussion of reference [13] in the introduction seems out of place, since this is a generalistic method on long-tail image recognition (there are tons of recent methods for general long-tail classification). I believe this section should probably be restricted to retinal fundus images and long-tail applications, in which case I would suggest removing [13], plus citing and discussing this paper instead: https://www.sciencedirect.com/science/article/abs/pii/S1361841520300256
- I understand that the BCE part of the final loss uses hard labels, whereas the KL loss uses soft labels. Is this the case? It would be better to explicit this in equation (4), I think.
- In the bottom part of Table 3, what does Ours refer to, is it shot-based, region-based, or feature-based?
- It does not seem that the subsection “Visualization of the label co-occurrence.” has nothing to do with ablation studies, and should probably be removed outside this section. Even more, I believe that it is rather obvious that stratifying classes in terms of being in the many-shot, average-shot, or few-shot sub-groups would result in what we can see in Fig. 4, so I would advice to actually remove this entire “Visualization of the label co-occurrence” subsection entirely from the paper (including Fig. 4), making room for expanded technical explanations in the body of the paper, particularly in the grouping section (2.1).
- I am a bit skeptic regarding using lesion localization as a grouping criterium, while at the same time resizing images to a resolution of 256x256. Considering the powerful cluster of 8 GPUs used in this work, why do we need to use such a low resolution? Is it in order to have a very large batch size? The authors for1got to indicate the batch size they used, by the way, which could be very important with a large class imbalance.
- For all the technical machinery displayed in this paper, results are a bit disappointing. The middle section of table 3 shows that “Ours” does not generate any top performance, unless for the Lesion-48 dataset in the head part of the data, and when used in conjunction with other losses, it only seems to get better performance for the Focal loss (the 69.75 it achieves for the medium part of the data should be bold-faced, by the way). It is even more concerning that the ablation study in Table 4 shows the simple Empirical Risk approach achieves greater performance in the head and medium parts of the data, being only outperformed in the tail. The improvement in the tail is big, and this results in greater average performance, but all in all, we are sacrificing lots of performance in the biggest part of the dataset to achieve better performance on the few shot part. Probably the authors should discuss if this is of enough clinical relevance to guarantee the interest of the approach.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I was inclined to assign this paper a “reject”, but it might be that adding further explanations on the relational subsets generation could improve the quality of this work. I remain open to revise my opinion if this is properly addressed.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This is an interesting work addressing long-tailed data problem in retinal image analysis, but several key questions have been raised by the reviewers. First is the difference with respect to reference [13] and other previous methods for long-tailed data. Second is the clarity of presentation. Many examples have been pointed out by reviewer 3. Third, there is also some concern regarding the performance of the proposed method.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Author Feedback

Q1. The difference with respect to reference [13]. R. a) [13] is presented for multi-class long-tailed learning and consists of many other components such as class-balanced sampling, which is proved to be not beneficial for multi-label long-tailed recognition due to the label co-occurrence. Besides, curriculum learning applied in [13] is not suitable for a multi-label setting. b) For multiple-teacher knowledge distillation, the teacher model and the student model have different dimensions, [13] is to separate the Logits of the student. However, it is difficult to unify the scales of Softmax outputs from multiple teachers, which makes it hard to train. For multi-label, the Sigmoid outputs are relatively independent and multi-weighting is easier to adjust. (c) [13] simply divides the original samples into less imbalanced subsets and leverage knowledge distillation for model training. In our scenario, we fully consider the relevance between retinal diseases and leverage prior medical knowledge as auxiliary information. Overall, compared with [13], our proposed methods have stronger clinical motivation and more suitable for medical tasks and multi-label settings.

Q2. Clarify of presentation on the subsets generation. R. In this paper, we present three rules for subsets generation. Besides shot-based variant, we also provide some basis of region-based variant in Table 1, and feature-based variant in Table 1 of the supplementary file. We have added an overview with description for each subset generation rule. We will release the code and datasets (including labels for common and rare diseases) used in this paper, to make this work reproducible for other practitioners.

Q3. The concerns for performance. R. Our method achieves competitive performance in a plain form, and gains large improvement compared against many SOTA methods such as OLTR while comparative results with DB Loss. It does not contain any complex and heavy computation components, which makes it easily combined with other methods and achieve SOTA performance. From the perspective of long-tailed challenge, the number of categories in the middle and tail of the head is not the same, as we show in the Fig. 5. In this work, we sacrifice minimal accuracy of a few head categories, in return brings a huge performance improvement for many categories in the tail. From the perspective of real-world medical applications, for human expertise, the common diseases are easy to be diagnosed while rare diseases are easy to be missed. Improving the accuracy of rare diseases recognition can become more useful for the clinicians within the CAD diagnosis system.

Q4. Multi-label challenge discussion and insights. R. As we discussed in the Introduction and Method sections, resampling and re-weighting strategies is not applicable under the multi-label setting. From the experimental results, we observe that resampling makes the performance worse. This challenge is tackled by two strategies in this work: a) Except for the shot-based variant, feature-based and region-based fully consider the relevance between various diseases. As shown in Fig.4, subsets generation makes the distribution relatively less-imbalanced. And learning from relational subsets enforces the model to learn from specific information. b) We leverage multiple-teacher knowledge distillation to distill the separate knowledge in a unified model. Under this manner, the risk of regarding the out-of-subsets classes (e.g. classes with label co-occurrence but not included in subsets) as outliers can be reduced.

Q5. Metrics such as AUC, recall and precision should be added. R. We think reviewer 1 misunderstood the mAP metric. As shown in reference [12], mAP is commonly used as the metric for multi-label long-tailed classification. mAP is calculated from the Precision-Recall Curve which fully consider the trade-off between recall and precision like AUC but more suitable for performance evaluation of high imbalanced data.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is an interesting work on tackling the long tailed problem in retinal disease recognition. While there is still some concern regarding the performance of the method, the rebuttal provides a reasonable argument about the clinical utility of the proposed method.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors proposed class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge. Overall, the concept is interesting, and paper is well-written. However, the novelty of this work is somehow limited, and also it might be difficult to reproduce the same results shown in this paper. The experiments are not solid enough, and it would be great if the authors can provide more experimental results in their future submission.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The proposed student-teacher based solution is an interesting and sensible approach to addressing the long-tailed classification challenges in a multi-label setting. The solution is somewhat specific to retinal image applications so may not be directly applicable to other organs. In the rebuttal they were able to argue why [13] is not applicable to their multi-label case. I also agree with their comment in the rebuttal that having a small decrease in head/medium categories is with the trade-off gained in improving the more challenging tail categories.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

back to top

Relational Subsets Knowledge Distillation for Long-tailed Retinal Diseases Recognition