Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yan Li, Kai Lan, Xiaoyi Chen, Li Quan, Ni Zhang

Abstract

The performance of anatomy site recognition is critical for computer-aided diag-nosis systems such as the quality evaluation of endoscopic examinations and the automatic generating of electronic medical records. To achieve an accurate recog-nition model, it requires extensive training samples and precise annotations from human experts, especially for deep learning based methods. However, due to the similar appearance of gastrointestinal (GI) anatomy sites, it is hard to annotate ac-curately and is expensive to acquire such high quality dataset at a large scale. Therefore, to balance the cost-performance trade-offs, in this work we propose an effective annotation refinement approach which leverages a small amount of trust data that is accurately labelled by experts to further improve the training perfor-mance on a large amount of noisy label data. In particular, we adopt noise robust training on noisy dataset with additional constraints of adaptively assigned sam-ple weights that are learned from trust data. Controlled experiments on synthetic datasets with generated noisy annotations validate the effectiveness of our pro-posed method. For practical use, we design a manual process to come up with a small amount of trust data with reliable annotations from a noisy upper GI da-taset. Experimental evaluations validates that leveraging a small amount of trust data can effectively rectify incorrect labels and improve testing accuracy on noisy datasets. Our proposed annotation refinement approach provides a cost effective solution to acquire high quality annotations for medical image recognition tasks.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_54

SharedIt: https://rdcu.be/cyl6u

Link to the code repository

https://github.com/Liian/Few-trust-data-guided-annotation-refinement

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper solves the data annotation problem by leveraging a small amount of trust data and apply sample weights during training. The proposed method is evaluated on GI dataset to prove its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper proposed a new solution for fast and cost-effective data annotation problem. The proposed method is able to rectify incorrect labels and improve training accuracy on noisy dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The description of the method is not very clear. For example, how is it possible to train Eq (1) when epsilon is zero initialized? 2. As far as i can see, w_i is defined by gradient g_i. Then, how is L_w trained? 3. Use symmetric as noise is somehow too simple. 4. 100 trust data is too many to say few.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is too hard to reproduce the results without the help of the authors.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The description of the method detail is not very clear and is very hard to understand. The authors should provide more details if possible.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is very interesting. By using both adaptive weighted loss and trust data guided annotation, it is able to improve the performance of the baselines. The proposed method is evaluated on upper GI dataset and the results proves its effectiveness.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    4

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    This paper improves noise-robust learning by incorporating additional clean dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • It’s a practically valuable motivation to “use a small amount of clean data to further improve the noise tolerance”. This task definition lifts the unnecessary limit that only noisy data should be used when a clean subset is available at low cost.
    • Good experimental results on publically available dataset are provided and showed the effectiveness of the proposed method. The evaluations are conducted at different noise settings and noise rate.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Not a single figure of GI, the anatomy site or visualization of predictions. The 30 classes don’t have names. It’s unintuitive to understand the task and evaluate the difficulty.
    • The clarity needs significant improvements. See detailed comments.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Code is claimed to be available, maybe upon acceptance.
    • If the code gets released, the proposed method can be at least reproduced on publicly available dataset, CIFAR-10.
    • Data is not downloadable, and no illustration of the data is available.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • It’s even unclear what the “recognition” task is, i.e. classification, detection or segmentation? I have to find evidence and guess from “coarsely labeled samples” and assume that the task is classification. The authors should define the task in direct and plain language.
    • Lacks estimation about how noisy the GI dataset is, which can be coarsely reported from collecting clean set. This has a high impact on potential performance gain from building a clean dataset.
    • Fig. 2 can be designed to be more intuitive, e.g. rendered in different colors to compare the confusion matrix.
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This could have been a better paper if more efforts are included to its writting and organization, i.e. missing materials are provided.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The noisy label problem is a fundamental challenge to medical image analysis, as comprehensively labeling a large-scale medical dataset is expertise-demanding and time-consuming. The paper proposes a new method that could efficiently leverage few images with clean labels to boost the accuracy of a model learned from large-scale noisy data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a new method that combines meta-learning-based loss re-weighting with DivideMix for the challenging problem of learning from noisy labels. The experimental results show the improvement from DivideMix, a state-of-the-art noisy label learning method, on both CIFAR-10 and a large-scale medical dataset, i.e., a GI anatomy classification dataset. The improvement of accuracy on the large-scale medical dataset is a large margin, which is appreciated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty of the paper is relatively limited. Specifically, using meta-learning to reweight losses for learning from noisy labels has been proposed via this paper: Ren, Mengye, et al. “Learning to reweight examples for robust deep learning.” International Conference on Machine Learning. PMLR, 2018.While DivideMix is also an existing method for addressing noisy labels. It’s unclear that what is the performance of Learning-to-Reweight, i.e., removing the DivideMix part of the method and only keeping the meta-learning component. As the proposed method is largely based on the combination of the two components, this ablation study is important to figure out the contribution of each component.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will be released as the authors claimed in the checklist. The CIFAR-10 dataset is publicly available. Overall, the reproducibility is fair.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Detailed ablation study and more state-of-the-art comparison methods should be included to make the results more convincing. It’s curious that how is the change of backbones (e.., ResNet18, ResNet34, DenseNets, etc.) would affect the final performance. The authors are suggested to quantify the noise ratio inside the medical dataset, e.g., sampling a subset of the training data and confirm with doctors how many of the sampled data are incorrectly labeled. This would make the paper more sound for real-world application. “Hard-coded prediction” is not clearly explained. It’s not clear how what anatomy sites are labeled and how many images per category. Supplementary information should be considered. It’s curious whether increasing the “label-refining and fine-tuning” iterations improves the final accuracy, as the paper seems only refine the labels for once. It’s also curious that whether increasing the number of reliable annotations will improve the final accuracy.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The noisy label problem is critical for medical image analysis, especially when it comes to industry-level application. This paper proposes a potentially applicable solution to the question with a few reliably labeled images, which is the major contribution from the reviewer’s point of view. The experimental results also show a large margin of improvement on the large-scale medical dataset. Although the novelty might be limited, overall the paper is appreciated for the contributions,

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    While the reviews are all in favor, the AC thinks the following questions should be clarified before making a decision.

    1. Please Address the issues in Box 4 of “Weakness” and in Box 7 of “detailed comment” raised by R#2.
    2. Please Address the issues in Box 4 of “Weakness” and in Box 7 of “detailed comment” raised by R#3.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

We thank all the reviews for detailed comments and constructive suggestions. We are encouraged that the reviewers are all in favor of our work, and credit our approach as a practical solution to address the critical problem of noisy annotations in medical images by effectively leveraging a few reliably labeled images. We drafted our paper using Word template that is found slightly different from Latex template in format and font size. We appreciate that Meta reviews changed their questions after figuring out this. Below we explain the mentioned weaknesses:

  1. The issues in Box 4 of “Weakness” and in Box 7 of “detailed comment” raised by R#2. Our upper gastrointestinal anatomy recognition is a classification task. We will clarify this in the final manuscript. The 30 categories of key anatomical sites are defined by following standardized protocol of upper GI endoscopy examinations. Similar to [1] [2], they cover from oesophagus, main body of stomach, to duodenal bulb, and include various angular views. We coarsely estimate that the overall noise rate of our collected GI dataset is about 30% according to the process of making the small set of clean labels that reach consensus by multiple doctors. We will supplement the detailed description and distribution of 30 categories, as well as noise estimation in the final manuscript for better understanding.
  2. The issues in Box 4 of “Weakness” and in Box 7 of “detailed comment” raised by R#3. Our work aims at developing a practically effective solution on solving noisy annotation problem in real clinical application. It is true that we have a similar idea of learning-to-reweight [3] using a clean validation set to improve our label refinement performance. However, simply using meta learning demonstrated limited performance because it cannot effectively learn from noisy labels. We choose DivideMix [6] to work on due to its SOTA performance on noise learning tasks, but it does not work with practical case in that reliably labelled samples may exist. In our industry-level project, it is possible to obtain a small set of clean labels and use them to achieve a well-balance of cost and performance. Therefore, we introduced the adaptive weighted loss term to guide model fitting to correct labels by giving more importance to the samples which can obtain similar gradients to reliably labelled samples. We agree that it is better to analyze the effect of only using the meta-learning module and understand its contributions to entire framework. In fact, we had made some related experiments (but we regret that they are not reported). It was found that the meta-learning module contribute especially in the case that model degrades severely in the presence of high asymmetric noise rate. This ablation study results will be added to the final version. Also thanks to the helpful suggestions on our GI dataset. We will include a detailed description for better understanding as we reply to R#2. The “hard-coded prediction” means that we use argmax function to transfer probabilistic predictions to a single category number. We regret that this is not explained in the paper and will add it for clarity. As for the comment on adopting iterative process for labeling refining and fine-tuning, we estimate that final accuracy might be improved, but perhaps appropriate constraints need to be designed to avoid trivial solutions (e.g., model predicts same outputs regardless the input). Also, increasing the amount of reliable labels is sure to bring positive impact to performance improvement. However, additional cost may involve and thus not be practical in the short term. This inspires us to consider what annotation strategy best address cost-performance trade-offs for such tasks in future. And we believe that the usage of backbones with more complicated structures (and/or pre-trained weights) can bring better performance, as proven by most of deep learning methods. We will leave these validations for future works.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The concerns and questions raised by the reviewers have been addressed in the authors’ rebuttal. As this paper made a good attempt to solve the data annotation problem by leveraging a small amount of trust data, which is of practical value for medical image computing, I recommend an acceptance to this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Handling training label uncertainty is likely to be a useful area of research. All reviewers tended more towards accepting the manuscript, and the authors say that many of the reviewers’ concerns will be addressed in the final manuscript:

    • Additional clarifications requested by the reviewers will be included, although little mention was made about clarifying the figures.
    • Authors say they will now include a measure of noise in the GI labels.
    • While the authors did not originally propose the DivideMix approach, they justify adopting it for their medical imaging task, and also introduced an adaptive weighted loss term to the model.
    • Results of a previously performed ablation study will be added to the manuscript.
    • Additional experiments to assess of the effects of changing the number of reliable labels or numbers of label-refining/fine-tuning iterations are unlikely to be performed in the time allowed.
  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The method addresses a timely and relevant problem. The original submission has certain limitations. They include the fact that the task at hand and its difficulty remained rather unclear, due to a complete lack of figures and missing information about the classes. They also include the fact that the importance of the previously proposed DivideMix approach for the overall success remained unclear. These points are partly clarified in the rebuttal, and should be reflected in the final version. I see no reason to overturn the reviewers’ unanimous recommendation to accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4



back to top