Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Zhe Xu, Donghuan Lu, Yixin Wang, Jie Luo, Jagadeesan Jayender, Kai Ma, Yefeng Zheng, Xiu Li

Abstract

Manually segmenting the hepatic vessels from Computer Tomography (CT) is far more expertise-demanding and laborious than other structures due to the low-contrast and complex morphology of vessels, resulting in the extreme lack of high-quality labeled data. Without sufficient high-quality annotations, the usual data-driven learning-based approaches struggle with deficient training. On the other hand, directly introducing additional data with low-quality annotations may confuse the network, leading to undesirable performance degradation. To address this issue, we propose a novel mean-teacher-assisted confident learning framework to robustly exploit the noisy labeled data for the challenging hepatic vessel segmentation task. Specifically, with the adapted confident learning assisted by a third party, i.e., the weight-averaged teacher model, the noisy labels in the additional low-quality dataset can be transformed from \textit{‘encumbrance’} to \textit{‘treasure’} via progressive pixel-wise soft-correction, thus providing productive guidance. Extensive experiments using two public datasets demonstrate the superiority of the proposed framework as well as the effectiveness of each component.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_1

SharedIt: https://rdcu.be/cyhLt

Link to the code repository

https://github.com/lemoshu/MTCL

Link to the dataset(s)

https://www.ircad.fr/research/3d-ircadb-01/

http://medicaldecathlon.com/

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a method to make use of separate datasets in which the same regions were attempted to be segmented manually but one had significantly higher quality (or potentially slightly different labeling instructions/criteria). The method makes use of a student-teacher paradigm and presents empirical results on hepatic vessel segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

First, problem that this work is trying to address is very important and pervasive in our field. Label noise is common and often unavoidable, and with such a scarcity of segmentation label, we are often forced to utilize labels from multiple sources with varying levels of quality.

Second, the paper is well-written and makes effective use of tables and figures to demonstrate the issue and support its arguments.

Finally, the authors used publicly-available data and they say they will release their code which will help facilitate the replication and extension of this work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The authors applied their method to only the single problem of hepatic vessel segmentation, and all empirical results are derived from a test set with only ten segmented cases. Because of this, even though the results look promising, it’s difficult to be sure that this is not just a lucky coincidence. No attempt is made to demonstrate the statistical significance of the reported improvement.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The prospects for reproducibility of this paper are excellent. The paper uses publicly-available data and the authors promise to release their code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I feel that this paper would be far stronger if the same experiments had been conducted on one or two other similar problems – perhaps ones in which HQ test sets with more than ten cases are possible. Here, you will have far more statistical power to demonstrate that your results are not just a statistical anomaly.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work is promising, but contains unconvincing empirical evidence that it truly represents a state of the art advancement.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

To overcome the lack of high-quality labeled data, the authors propose a mean-teacher-assisted confident learning framework for hepatic vessel segmentation purposes. By encouraging consistent segmentation under different perturbations of the same image, the network can exploit image information with low-quality annotations. The teacher model serves as “third party” to provide guidance in identifying label noises and perform progressive pixel-wise soft-correction. Experiments using two public datasets (3DIRCADb, Medical Image Decathlon) demonstrate the effectiveness of the proposed contributions and show that the annotation quality of noisy labeled data can be improved with a small amount of high-quality labeled data only.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses the not widely investigated but crucial following issue: how to robustly exploit abundant low-quality noisy labeled data for segmentation purposes?
- Nice set of contributions including vessel probability map use (from Sato tubeless filter) as auxiliary input modality and adaptation of confident learning in a mean-teacher learning segmentation framework
- Methodological contributions are assessed rigorously through a detailed ablation study
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Authors should provide more insights to explain the proposed progressive self-denoising process
- Lack of comparisons with respect to state-of-the-art (Huang et al. and standard U-Net only)
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The associated code will be released after the anonymous review. In addition, the proposed method could be easily re-implemented based on provided architecture and training information. Both 3DIRCADb and Medical Image Decathlon datasets are publicly-available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
The submitted paper is innovative and well written. Comments provided below could be taken into account for further improvements.

Main comments :
- Method. The student model is optimized by minimizing the supervised loss on high-quality annotated data as well as an unsupervised consistency loss between predictions from student and teacher models on both datasets. The progressive self-denoising process which is added to the whole framework remains less clear. You should explicitly explain that $\hat{p}_j$ is obtained using the teacher model. I also suggest you to provide more interpretations to let readers precisely understand: 1- the estimation of $t_j$, 2- how the joint distribution matrix formulation (Eq.2) is derived from the confusion matrix (Eq.1) as well as 3- the smoothly self-denoising operation (Eq.3). Finally, I wonder why $\mathcal{L}_s$ involves cross-entropy, Dice, focal and boundary losses while $\mathcal{L}_cl$ uses cross-entropy and focal losses only.
- Experiments. The high-quality annotated dataset is randomly divided into two groups (10 cases for training, 10 for test). The article does not mention any cross-validation strategy. Do you use cross-validation to strengthen the effectiveness of the obtained results? Even if the 3DIRCADb dataset is relatively small, you should include a validation subset to find the optimal network hyper-parameters.
- Ablation study. What does “MTCL(c) w/o SSDM” means exactly? Does it correspond to hard-correction (instead of the proposed soft-correction, Eq.3) ?
- Comparisons. I would suggest you to use other state-of-the-art methods additionally to Huang et al. [8] and standard U-Net to improve the experimental part.
Minor comments:
- Introduction. Instead of “noises”, I would refer to “wrong annotations” when dealing with unlabeled or mislabeled pixels.
- Related works. You should explain more precisely why Semi-Supervised Learning (SSL) fails to exploit the potential useful information of noisy label.
- Methods. An Exponential Moving Average (EMA) is employed. What is the influence of the EMA decay rate $\alpha$? In the same spirit, you could explain how $n$ and $\tau$ parameters are selected and what is their related sensibility.
- Results. I would have integrated the quantitative results of sub-section “Effectiveness of label self-denoising” into Tab.1.
- Statistical analysis: performance gains could be confirmed using t-tests.
- Formulation. In the sentences “Extensive experiments […] demonstrate the superiority […]” (abstract), “The results demonstrate the superiority […] “ (Sect.1), “The superior performance [..]” (conclusion), you should explain with respect to what!
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- (+) Innovative contributions
- (+) Well-conducted ablation study
- (-) More insights required to explain the progressive self-denoising process
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper introduces a method to train a segmentation network on a dataset containing incorrect or incomplete labels, which may happen even on public datasets. The method is based on a mean-teacher training procedure, where both a trainer and a student networks are trained simultaneously. The high quality images are used as usual, while the low quality ones are used (1) in an unsupervised way via a consistency loss over random perturbations, (2) within a novel confidence estimation and relabeling procedure. The method is evaluated in the context of liver vessel segmentation using two public datasets: one with high quality segmentation and the other with incomplete segmentations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Datasets with bad segmentation quality are still quite common, so this method addresses an important problem which can be helpful to the community.
- The experimental setup is sound and seems pretty convincing, ablation studies have been performed.
- The authors plan to release their source code.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The method is quite complex with many parameters: three loss functions (one of them also having three sub-components), \lambda_{cl} mus be set to zero for the first thousands iterations, etc. I expect it to be a bit hard to re-implement.
- The description of the method could be improved (see comments below).
- The validation dataset of 10 images is quite limited.
- Some other baselines could have been tested (see comments below).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I disagree with the claim that “The average runtime for each result, or estimated energy cost.” is not applicable in this case. Training two networks instead of one definitely comes with a cost in computational time and memory footprint compared to a standard UNet. This should have been discussed.

The code has not been attached as supplementary material.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- It seems that most of the wrong pixel labels are false negative: vessels that should have been labelled but are not (due to the tedious nature of the task). There are losses (for instance Tversky loss) to accomodate for that kind of bias, but they are unfortunately not considered in the paper. Furthermore there are some other more simple baselines: train a network, then apply it on the training images, then retrain on those newly labeled images.
- Reporting average numbers are not enough - we need more information on the error distribution (min/max/std). Actually in this case, since there are only 10 validation cases, it would have been possible to report the numbers for each case individually.
- It is a bit startling to still see papers without any statistical test.
- I am not sure if the experiment comparing training on the HQ+LQ data vs training only on HQ data is fair. Since there are much more LQ data, it seems that training on HQ+LQ means training on LQ only. The HQ should be sampled much more often so that an HQ image appears as often as a LQ one.
- The decision to not resample in the z-direction is very surprising. If the slice thickness changes, the network has to unnecessarily learn structures with a different size in this direction. I do not understand the invoked rationale of “avoiding resampling artifacts”. Furthermore this is even acknowledged by the authors when they report that 3D networks are worse than 2D ones.
- There is a little bit of name dropping overall in the paper (but more particularly in Section 1), which may seem a bit intimidating for some readers (“Mean-Teacher-assisted Confident Learning”, “Classification Noise Process”, “Smoothly Self-Denoising Module”). Adding a primer on teacher-student networks would have made the paper self-contained.
- I have the feeling that the subsection “Learn from Progressively Self-Denoised Soft Labels of Set-LQ” could have been explained more simply. In that case, I don’t think the mathematical notations help a lot. For instance, you could consider moving all the equations to a supplementary material and spend more time giving a higher level, more intuitive, description of the approach. Or maybe some pseudo-code would have been clearer.
- The word “treasure” in the title and the paper is a bit clickbaity (when consdidering the overall improvement) and could be toned down.
- Figure 3 is not very readable, and in particular not colorblind friendly (consider using blue instead of green for instance)
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although I do think this paper could be improved, it addresses a relevant problem and the presented method seems to provide a significant boost in terms of segmentation. I have some concerns about reproducibility (and in particular re-implementation) but the authors plan to release their code, therefore I recommend to accept it.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Strengths:
- First, problem that this work is trying to address is very important and pervasive in our field. Label noise is common and often unavoidable, and with such a scarcity of segmentation label, we are often forced to utilize labels from multiple sources with varying levels of quality.
- Second, the paper is well-written and makes effective use of tables and figures to demonstrate the issue and support its arguments.
- Finally, the authors used publicly-available data and they say they will release their code which will help facilitate the replication and extension of this work.
- - Methodological contributions are assessed rigorously through a detailed ablation study
Weaknesses:
- inclusion of typical baselines would be useful for the miccai audience
- A second application target would make an excellent journal extension.
Overall:
- very strong, we justified paper that could be improved with a typical baseline.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Author Feedback

We are glad that reviewers found our problem and motivation “not widely investigated but very important and promising in MIA field”. They also found our work “innovative”, “well-written”, “makes effective use of tables and figures to demonstrate the issue and support its arguments”, “nice set of contributions” and “methodological contributions are assessed rigorously”.

We also appreciate all the constructive comments from reviewers such as some valuable suggestions about the design of other baselines. Besides, thanks AC for appreciating the second application target (i.e., using tiny HQ labeled data to improve label quality of LQ labeled data) would also be an interesting direction. Consistently, we are working on our extended journal version. Nevertheless, this work is still an early attempt and can be further improved. We feel grateful that this work is early accepted and look forward to more research on this field.

back to top

Noisy Labels are Treasure: Mean-Teacher-Assisted Confident Learning for Hepatic Vessel Segmentation