Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Theodoros Pissas, Claudio S. Ravasio, Lyndon Da Cruz, Christos Bergeles

Abstract

Our work proposes neural network design choices that set the state-of-the-art on a challenging public benchmark on cataract surgery, CaDIS. Our methodology achieves strong performance across three semantic segmentation tasks with increasingly granular surgical tool class sets by effectively handling class imbalance, an inherent challenge in any surgical video. We consider and evaluate two conceptually simple data oversampling methods as well as different loss functions. We show significant performance gains across network architectures and tasks especially on the rarest tool classes, thereby presenting an approach for achieving high performance when imbalanced granular datasets are considered. Our code and trained models are available at https://github.com/RViMLab/MICCAI2021_Cataract_semantic_segmentation and qualitative results on unseen surgical video can be found at https://youtu.be/twVIPUj1WZM.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_49

SharedIt: https://rdcu.be/cyhQ4

Link to the code repository

https://github.com/RViMLab/MICCAI2021_Cataract_semantic_segmentation

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present an ablation study for different architectures, loss functions and sampling strategies to approach the problem of tool detection on video footage of cataract surgery. Their work employs the open CaRDIS dataset, for which three different tasks with varying class granularity (and thus varying class imbalance) were proposed in previous work. Results show that, with an adequate oversampling strategy and loss function, a new baseline can be set for the CaDIS dataset that outperforms the concurrent approach significantly.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors present a thorough evaluation of different architectures and sampling strategies, which gives the reader a good idea of an adequate training setup for the CaDIS dataset.

    Qualitative results presented in supplemental material look nice and convincing.

    Quantitatively, the best proposed approaches beat SOTA significantly.

    The authors identified “179 at least partially mislabelled frames” in the dataset, which contributes to the quality of the dataset.

    The paper sets a new baseline for the open CaDIS dataset. This new baseline can be identified as the main novelty of the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper has no major weaknesses.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Since the manuscript documents hyperparameters and model architecture very well, the approach should be reproducible with considerable effort. Using the open CaDIS dataset also contributes to reproducibility. Even though code does not seem to be available, the information provided in the paper should suffice to re-implement this approach.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Section 2.2: “The UPerNet head [14] consists of a Feature Pyramid Network…”
  • Please state your overall opinion of the paper

    strong accept (9)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written, clearly described and thoroughly evaluated in an extensive ablation study. Qualitative and quantitative results show that the authors’ approach outperforms the previous baseline.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper proposes a grid search over state-of-the-art methods for the three semantic segmentation tasks in cataract surgeries in the public CaDIS dataset. This work considers three network architectures for the decoder, three loss functions, two learning rate schedules, and two data oversampling strategies. Additionally, it analyzes the effect of the use of two data oversampling strategies in highly class imbalance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of this work is the extensive experimental setup.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The principal weaknesses of the paper are listed below:

    • This work does not present technical novelty. All the individual components that are used in this work already exist and they are already implemented. This paper presents just a compilation without almost any modification. Specifically, the three decoder architectures were taken from [12-14]. The three loss functions correspond to the traditional cross-entropy loss and the two additional loss functions were taken from [21] and [23]. The chosen data oversampling strategy was taken from [24] and the other one correspond to an adaptive sampling algorithm. This family of algorithms has been previously used in [29] for semantic segmentation. [29] Berger, L., Eoin, H., Cardoso, M. J., & Ourselin, S. (2018, July). An adaptive sampling scheme to efficiently train fully convolutional networks for semantic segmentation. In Annual Conference on Medical Image Understanding and Analysis (pp. 277-286). Springer, Cham.
    • Despite reporting results in the public benchmark dataset for the studied task, this paper modifies annotations and removes entries from the dataset. These changes on the dataset affect the reproducibility of the experiments, and it does not allow a fair comparison with state-of-the-art methods.
    • Poor experimental setup: In addition to dataset modifications, the paper does not use splits correctly. Specifically, the paper performs all the experiments evaluating on the evaluation set along with the validation set. The right data split for training, validation, and test sets allows to assess the capacity of the model to be a general model. Optimizing the architecture over both validation and test might result in overfitting. Without results in an independent set, it is not possible to discard that possibility.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    In the reproducibility checklist, it is reported that the code and pre-trained models will be made publicly available, which is very important. However, all the data that was modified will not be free up which impacts in a meaningful way the reproducibility of the outcome.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The technical novelty is a relevant aspect for the acceptance of a paper. This paper addresses a relevant topic that still has many clear challenges. However, it is necessary that these challenges would be addressed in a novel way. The resultant work should follow a strict experimental methodology that guarantees the reproducibility and fair comparison of the results. Additionally, the text is clear, but there are some punctuation errors and misspelling of words that need to be corrected.

  • Please state your overall opinion of the paper

    strong reject (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite the relevance of the task and the exhaustive experiments reported, the paper does not have anything new, the way experiments are performed lacks strict protocol making them invalid as it is unclear the way to reproduce them.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The authors propose a neural network method to segment videos acquired during cataract surgery. Every anatomy and instrument is segmented frame by frame. Some instruments appear rarely so there is high class imbalance. The neural network is evaluated using three existing architectures (UPerNet, OCRNet, DeepLabv3+), three different loss functions, two different ways to sample batches during the training, two different weights initialization and two different learning rate schedules. For the training and test, a publicly available dataset: CaDIS with 25 surgery videos is used. Overall, the best combination, addressing class imbalance with smart batches, gives a better IoU compared to the state-of-the-art with all the experiments (different number of class to segment).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The main strength of the paper is to tackle class imbalance problem and improve state-of-the-art segmentation method on a public dataset even though the proposed solutions are combinations of already existing method.
    • Ablation study enables to have a better insight of the impact of every part of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is difficult to see if mean IoU difference are significant between different methods. For example, in a clinical point of view, does it change a lot to have a mean IoU difference of 0.01 or 0.02? It could have been nice to have statistical test and results with min, max and deviation. Also, discussing some qualitative results with failed segmentation.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • It is said that code will be provided
    • The dataset is publicly available
    • All the parameters are given for the implementation of the method (even though the neural network architecture can be tough to understand without figure, but reference to previous paper is given)
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Page 1 Abstract

    • “Our methodology achieves strong performance”, “significant performance gains”, “high performance” -> I find this too vague, can please you give quantitative results or comparison instead?

    Page 2

    • “We report strong performance on all sub-tasks of the CaDIS dataset” -> Same here

    Page 3

    • “We use the same training set as [15] and use the merged validation and test sets of [15] for evaluation” -> Why did you remove the test set? What do you mean then by “evaluation” in your paper? Test?
    • “and passing intermediate-resolution latent features to the network head as required” -> This sentence is not clear to me. Do you mean skip connection? What is ‘required’? What is a network head?
    • “which instantiate three different mechanisms for enhancing and decoding the encoder’s feature maps into dense semantic labelling” -> Maybe would be nice to have a figure summarizing the 3 different networks. Only text makes it hard to understand.
    • Figure 1: the eye retractor is more purple than dark blue?

    Page 4

    • “2.3 Loss Functions” -> It is not completely clear here that you discuss about the three loss functions that you are going to evaluate. Maybe rephrase a bit to have something more structured like in “2.2 Network Architectures” section

    Page 5

    • “Each of these records is chosen as the the one with the highest number of pixels labelled with the class of interest out of 10 random samples from the dataset.” -> So, for every frame in the batch, you take 10 random samples from the dataset and the ‘most suitable’ sample among the 10 is chosen. When I look at the distribution of class in CaDIS for the training set, I see that in 4 classes there are less than 40 frames available. I have the feeling that with the selection of 10 random samples, those rare frame will never be selected and so those classes never trained more? What is wrong in my reasoning? Do you assure at least that the 10 random samples are picked in order to choose all the frames at some point? Also, if you have never seen a specific class, what is the IoU moving average of the class? 1 or 0? If it is 1, that means the rare class won’t be prioritized?
    • “Adam optimiser [25] for a maximum of 50 epochs” -> Since you are using different ways to sample batches (repeat factor and adaptive -> same frame can occur in one epoch), what does an epoch mean here? still the numbers of frames your training set has?
    • “a learning rate starting at 10−4 and exponentially decaying at a multiplicative rate of 0.98 per epoch” -> The decay part conflicts with the following paragraph about the two learning rate functions. Which function did you evaluate at the end? (It is clear in the results section).

    Page 6

    • Table 1: It is difficult to see if mIoU difference are significant between different methods. For example, in a clinical point of view, does it change a lot to have a mean IoU difference of 0.01 or 0.02? It could have been nice to have statistical test and results with min, max and deviation. Also, discussing some qualitative results with failed segmentation.
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • achieve state-of-the-art and ablation study
    • It is difficult to see if mIoU difference are clinically significant between different methods
  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Recommendations vary widely for this submission, ranging from strong accept to strong reject. All reviewers agree on the completeness of the experimental setup. However, R2 identifies significant drawbacks in the paper, which should be thoroughly addressed in the rebuttal. In particular: (1) The unjustified modification of the standard evaluation protocol and the annotations of a public dataset, which directly impacts the fairness of all comparisons against the state-of-the-art and forbids results reproducibility. (2) The lack of technical novelty of the approach.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8




Author Feedback

We thank the reviewers for their encouraging and constructive comments. Suggestions to improve clarity, as those provided by R4, will be incorporated in a potential camera-ready manuscript. We address here the key concerns, as also summarised by the meta-reviewer.

(1) While both R1 and R4 are in favour of our contributions, R2 complains about lack of novelty, a statement with which we strongly disagree. R2 provides a list of the references for various workflow components, but this is of unclear value as all references are well stated in the manuscript. In addition to the broad technical innovation in the form of semantic segmentation algorithm design and evaluation:

  • We tailor Repeat Factor Sampling, up to now only used on instance segmentation, to the task of semantic segmentation. We are not aware of prior research on the topic, and should have highlighted it better in the manuscript.
  • Our Adaptive Sampler is a novel implementation of the concept in comparison to the reference provided by R2: it is biased towards the selection of whole images with a high incidence of globally underperforming classes, using momentum update, rather than the selection of patches with high error values within given images. This contribution will also be highlighted.
  • We evaluate what matters in semantic segmentation with thorough experiments, reaching state-of-the-art results on a challenging benchmark across three increasingly granular tasks.
  • We showcase that network architecture choice, despite being an extensively studied topic in semantic segmentation, has a negligible effect on CaDIS benchmark performance compared to the choice of loss function and data oversampling methods.

(2) R2 criticises our merging of the CaDIS validation and test set (each comprising just three videos). We adopt this protocol as, contrary to R2’s suggestion, we do not optimise network architectures, nor do we fine-tune over state-of-the-art methods using a grid search: instead, we use a single representative data subset to identify factors that govern performance in this challenging and highly imbalanced dataset. The test set in [15] is overly limited in size for this purpose, which further skews the class distribution. Some of the rarest classes occur in just one of the three videos in the test set, with as few as seven highly correlated sequential instances in the 587 frames overall. No robust conclusions on performance can be drawn from such a limited occurrence. Using the larger merged set enables evaluation on more representative data.

The validity of this approach was also borne out in the MICCAI 2020 Endovis CATARACTS Segmentation Challenge on the CaDIS dataset, where our work yielded very competitive results on the held-out server test set. Anonymity rules unfortunately preclude us from revealing our ranking. Using a training / validation split over the dataset is common practice in many benchmark datasets for semantic segmentation and beyond, for example PASCAL-Context and COCO-Stuff, for which the performance on the validation split is reported [13].

(3) R1 points out that the data introspection we carried out will enable fairer future comparisons and is a contribution in and of itself. R2 incorrectly assumes that we will withhold this information; as a matter of fact we have communicated these findings to the team that published the dataset, and plan to publish them in full detail, alongside our code and pretrained model weights. This will ensure complete reproduction of our work, while the cleaned up dataset will help with teasing out the maximum performance of any developed network and pipeline.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the rebuttal clarifies some of the reviewers’ comments, significant concerns remain. In particular, the lack of novelty and a clear design principle in the proposed method, which is reduced to trying a different backbones, losses, and sampling strategies. The rebuttal also falls short on justifying the change in the standard setup of the CaDIS dataset. The paper may propose an additional setup to assess performance, but results on the complete standard splits are still necessary to accurately assess the contributions of the proposed method with respect to the state-of-the-art.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    21



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After reading the paper, the reviewer’s comments and the authors rebutal, I am included to accept this paper at MICCAI.

    While the claim of limited novelty is true to some extent, as the major novelty is in the modifcations from detection to segmentation, the strong performance on cataract sugery outweighs this limitation.

    In addition, given that the authors have clearly explained this experimental setup and will provide their code, I see reproducability favorably in this case.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    According to the reviews and AC comments, this is a typical borderline submission. The original reviews agree on the completeness of the experimental setup; however, the scores cover both extreme ends. The topic is interesting and somewhat new to the community, however, the technical contribution is limited. Overall, the paper is self-contained with proper clinical value and no obvious defects. Therefore, I agree to accept this paper for a wide discussion at MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    9



back to top