Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yuqian Zhou, Hanchao Yu, Humphrey Shi

Abstract

Retinal vessel segmentation from retinal images is an essential task for developing the computer-aided diagnosis system for retinal diseases. Efforts have been made on high-performance deep learning-based approaches to segment the retinal images in an end-to-end manner. However, the acquisition of retinal vessel images and segmentation labels requires onerous work from professional clinicians, which results in smaller training dataset with incomplete labels. As known, data-driven methods suffer from data insufficiency, and the models will easily over-fit the small-scale training data. Such a situation becomes more severe when the training vessel labels are incomplete or incorrect. In this paper, we propose a Study Group Learning (SGL) scheme to improve the robustness of the model trained on noisy labels. Besides, a learned enhancement map provides better visualization than conventional methods as an auxiliary tool for clinicians. Experiments demonstrate that the proposed method further improves the vessel segmentation performance in DRIVE and CHASE_DB1 datasets, especially when the training labels are noisy.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_6

SharedIt: https://rdcu.be/cyhLy

Link to the code repository

https://github.com/SHI-Labs/SGL-Retinal-Vessel-Segmentation

Link to the dataset(s)

https://github.com/SHI-Labs/SGL-Retinal-Vessel-Segmentation

Reviews

Review #1

Please describe the contribution of the paper

The author proposed the “Study Group Learning” that ensembles of models trained by subsamples through cross-validation to increase the robustness for noisy labels in the retinal vessel segmentation task. In addition, they added an enhancement module in front of a segmentation module that may help the segmentation training. Finally, they demonstrated the effectiveness of the proposed model in various datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- They propose the constraint loss using a pseudo label generated by K models to avoid overfitting for the noisy labels.
- They analyze the feature map as an enhanced map between the enhanced module and segmentation module using bottleneck network structure.
- To prove the robustness of the proposed model, they conducted experiments using synthetic noisy labels (erased by preprocessing) as shown in Fig. 3.
- Finally, they demonstrated the novelty of the proposed method based on DRIVE and CHASE_DB1 datasets compared to state-of-the-art methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The current result is limited to the mean value from the single experiment. More rigorous validation, computing the mean value with standard deviation over multiple experiments with different initialization while fixing the hyperparameters would be more informative to robustly assess the performance of the proposed method.
- I have a concern that the performance gap between the proposed and the other methods is small and not significant. The result could be the opposite for some other hyperparameters.
- Only visual quality assessment of an enhanced map is not sufficient to show the advantage of the bottleneck structure that can help the clinicians for visual inspection. Reader studies from experts or quantitative evaluation will be helpful to further explain the novelty of the bottleneck structure.
- SGL is similar to the bagging method that ensembles models trained by subsamples. Even though the bagging method was applied as a loss constraint, SGL and bagging methods may show similar performance in my opinion.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

To reproduce the results of this paper, we recommend publishing the code because the paper does not include a detailed description of the training, e.g., learning rate, a learning rate scheduler, optimizer, batch size, etc.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Weakness of the performance * As mentioned above, I have a concern that a small performance gap between comparison methods and the proposed method can be easily invalidated by the different hyper-parameters such as epochs, batch size, learning rate scheduler, optimizer, so on. For example, in Table 1, the performance gap between BEFD-UNet and the proposed method is only 0.0004 accuracy, 0.0019 AUC, 0.0049 DICE score. Additionally, the performance improvement by SGL is not big. In the below two rows of Table 1 and 2, these results indicate the performance improvements from SGL. To solve such concerns, we recommend that put the mean and standard deviation results for several experiments with different initialization instead of the result from a single experiment. Furthermore, you can verify the novelty of the proposed method through external validation if an external dataset is available.
- Assessment of the enhanced map * The bottleneck structure creates the enhanced map that has better visual quality than the other contrast enhancement methods. Quantitative results for enhanced map-based training instead of original inputs (like stacking) may show the good points of the bottleneck structure.
- Test data selection biased results * We recommend the nested cross-validation to avoid the test set overfitting problem. Depending on how you select the test set, it may lead to a different result. Therefore, outer cross-validation results of the nested cross-validation can represent the overall performance fairly.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of using bottleneck structure for creating the enhanced map without the supervision of ground truth seems novel.

Extensive experiments demonstrate the best performance compared to the other state-of-the-art methods.

They show the robustness through the synthetic datasets for various noise levels in terms of measurement metrics.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

4
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper introduces an enhancement map, which is better visually plausible to aid clinicians for visual inspection or manual segmentation. The authors modify the vessel segmentation label with a designed pipeline. They further propose K-fold Study Group Learning (SGL) to better cope with noisy labels in small datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Few previous works have studied the noise label. The author proposes a Vessel Label Erasing method, which controls the different widths of the vessel label. In Fig. 3, we can observe the controllable noise label. However, the training details and testing details of this method have not received much attention in the paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Compared with other enhancement methods (e.g., CLAHE), the proposed enhancement map may generate the wrong topology.
2. The proposed model connects two networks in cascade, which seems to have a large number of parameters.
3. It seems Vessel Label Erasing is an interesting method. However, From Fig. 5, we observe r=1 is the best solution, which means not using Vessel Label Erasing is the best option? In this paper, we cannot judge the effectiveness of Vessel Label Erasing.
4. In the test, the authors did not clarify whether they calculated the metrics only inside the retina or in the complete image. This is very important because the retina is small, which will have an impact on the indicators.
5. There is no visualization results compared with other methods. It is necessary to maintain uniform training and test details to keep the fairness of experiments.
6. Some writing errors.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Poor.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Major comments:
1. For the enhancement map, there is no supervision. Many textures are indeed generated, but I am not sure whether it is the correct texture. In other words, from a visual observation, sensitivity should be improved, but precision may be reduced. Compared with other enhancement methods (e.g., CLAHE), this method may generate wrong topology, so I don’t think it is a better auxiliary tool to aid clinicians for visual inspection.
2. The proposed model connects two networks in cascade, which seems to have a large number of parameters. Compared with other models, the parameters of the model should be reported. In addition, you did not announce the selection of the optimizer, the initial learning rate and learning rate strategy, batch size and other training details. Considering there is no open-source code, there may be problems with the reproducibility and fairness of experiments.
3. It seems Vessel Label Erasing is an interesting method. However, From Fig. 5, we observe r=1 is the best solution, which means not using Vessel Label Erasing is the best option? In this paper, we cannot judge the effeteness of Vessel Label Erasing.
4. In the test, the authors did not clarify whether they calculated the metrics only inside the retina or in the complete image. This is very important because the retina is small, which will have an impact on the indicators. This problem was found in the metrics calculated in Zhang [1]’s work. This can be verified in the link [2], where he does not use a mask to limit the calculation to the retina. Therefore, his indicators are low. The baseline you compared may not have a unified evaluation criterion.
5. The author does not seem to reproduce the previous methods. Because there is no visualization results compared with other methods. It is necessary to maintain uniform training and test details to keep the fairness of experiments.
Minor comments:
1. Some writing errors. -“These materials must not exceed two pages and must NOT bear any identification markers. ” and you have three pages.
  - “a complete annotations”>”a complete annotation”
  - The abbreviation only needs to appear once. But “Study Group Learning (SGL)” appeared nine times in the article.
[1] Zhang, S., at al.: Attention guided network for retinal image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 797–805. Springer (2019) [2] https://github.com/HzFu/AGNet/blob/master/code/test.py.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The first contribution is wrong and lacks experiments. Although the second contribution seems effective, it lacks experimental support. As mentioned in weakness, there are some important experiments/concepts that are not explained clearly in this paper.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

Due to the onerous effort of manually labelling retinal images, it is hardly possible to provide a sufficient (amount and quality) number of training images for end-to-end deep learning approaches. In this paper, a Study Group Learning (SGL) scheme is applied to overcome the noisy data, with experts from various domains (neurologists, cardiologists, ophthalmologists) involved. The visualization, an enhanced contrast as side-aspect of the machine learning approach, can be used as auxiliary tool for clinicians as a side-product, too.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Highly relevant topic, well addressed. Easy to read and understand. Leaving out some of the label branches allows to simulate the real-world nature of the low-scale retinal datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Paper already available/published via arXiv which is fine according to MICCAI guidelines.
Comparison to “standard” baseline methods for the utilized datasets DRIVE as well as CHASE DB1 is provided but no ensemble models or sophisticated deep learning approaches.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

use of public datasets DRIVE as well as CHASE DB1 but code, parameter details and so on not available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Consider using common “U-Net” instead of “UNet” in the paper. There is a clear difference between “skeleton” as used in the paper and thinning according to the reference! Regarding dataset and annotation: unclear parameter set – how much transformation and so on. Insufficient there to only cite [19]. Supplementary material with 25MB for 1 page of images unnecessarily large as totally uncompressed! Fig.4: Own results should be presented in grayscale color range too, to allow for comparability.

Level of novelty: Application of common approaches such as skeletonization, contrast enhancement and U-Net based deep learning iteratively trained by Study Group Learning (SGL). Nevertheless, it is shown that these approaches are applicable in case of fuzzy data labels and general lack of data. Randomly cutting the branches can be seen as slight kind of innovation in the field of data augmentation.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

nice application of common image processing strategies. Somehow adequate way to apply data augmentation on vessel (or graph) data by randomly pruning

It looks quite sound with a good combination of various image processing strategies highly relevant for clinical practice.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers have divergent opinions on this paper. R2 pointed out that the first contribution is wrong and lacks experiments. The second contribution also lacks experimental support. Some important experiments/concepts that are not explained clearly in this paper. Please clarify these issues in the rebuttal letter.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Author Feedback

Enhancement Map (R1R2):

Possible Wrong Topology. (R2) R2 pointed out that the enhancement map is learned without supervision and may generate wrong textures. Thanks so much for this very interesting question. We want to clarify: (1) the map is implicitly supervised by the segmentation map through the network. The training forces the bottleneck features to be compressed and more related to the output topology for a better visualization. So it won’t generate new topological textures non-related to the input and output. We should differentiate our discriminative methods with other generative models like GANs inputting random noises. (2) There might be some noise patterns being highlighted, but the same problems occur in the estimated segmentation results while balancing the precision and sensitivity using thresholding. Experts can still differentiate them. Compared with CLAHE, the learned map has three advantages. (1) Region of Interests can be better highlighted. It aggregates the information from RGB channels and extracts the features relevant to the vessel, while CLAHE only uses grayscale input. (2) No saturation artifacts like CLAHE. (3) It helps us interpret the segmentation errors to debug the system. In real applications, it can be combined with other enhancements for visual inspection. We will follow the advice to revise the claim from ‘better’ to ‘additional auxiliary tools’.

Quantitative Evaluation (R1): We further evaluate the learned map by comparing with the results of using the raw image in the baseline model. It shows the learned map yields better performance like DICE (0.826 vs 0.823). We will add discussions in the next version.

Training Details (R1R2R3): We trained the model using Adam (0.9, 0.999) with lr 1e-4, batch 16 and we trained for 30 epochs on the first stage, and 20 on the second. We will release the code, data and the pretrained models.

R1: Thanks for the suggestions. The training and testing data for both datasets are very small, so we cannot expect the performance gap to be very large. It’s fairly hard to boost DICE score from 0.825 to 0.830. Following your suggestions to run 5 times for the SGL with r=1, we found the results to be very stable with DICE score (0.8307 +- 0.0013). The new results won’t alter our previous conclusions. We will update other entries of Table 1 and 2. R2:

Parameter: Though we did not claim our baseline as lightweight, we will follow the suggestion to add a p size column to Tables. Our baseline is comparable to others in performance and parameter size(15M v.s. 13M in IterNet). But it is the proposed SGL training strategy to boost our baseline performance to SOTA under the same testing settings. Our SGL can also be easily applied to other structures with different p sizes.

Random Erasing: Our random erasing is not yet served as a data augmentation strategy, but to simulate noisy labels in the dataset. Lower r means the training label has more noises, so the performance will degrade. Figure 5 shows the SGL improves the performance of the model trained with noisy labels. Random erasing as data augmentation will be explored in the future. 4&5. Uniform Testing: Thanks so much for this comment. Like some most recent works including SA-UNet, BEFD-UNet and IterNet, we also compute the metrics on the whole image. We agree with the comments so we reproduced the results of LaddarNet, R2UNet and IterNet(DRIVE) to compute the metrics over the whole image. The Sensitivity and DICE won’t be influenced too much with or without the mask, and the Specificity and AUC is increased due to smaller regions, but still not better than ours. It won’t change our conclusions. We will update Tables. R3: Thanks for your kind comments. We used the elastic warping in a combination of random affine transformation and random offset. We implemented the codes of [19] and set alpha=patch_size * 3, sigma=patch_size *0.07, alpha_afine = patch_size * 0.09. We will address other concerns.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have clarified the major issues regarding the contributions and experiments. The AC would like to recommend “Accept” of this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

11

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper is well written and organized. The rebuttal well addressed most of the concerns raised by the reviewers. However, the proposed enhancement approach is not well verified - only compared with three old enhancement works, and none of DL-based enhancement model was studied. I would like to accept this paper at this stage.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors’ response reduces my concerns on the two major issues summarized by the primary AC. The method is basically correct and the experiments are adequate. Especially the improvement on the CHASE-DB1 dataset (see Tab.2), which is more challenging with many disease cases. The most challenging issues of vessel segmentation research are annotation (pixel-level quality and quantity) and the influence of diseases. This work is self-contained with certain novelty, thus I recommend an acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

back to top

Study Group Learning: Improving Retinal Vessel Segmentation Trained with Noisy Labels