Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Ricardo Sanchez-Matilla, Maria Robu, Imanol Luengo, Danail Stoyanov

Abstract

Computer vision based models, such as object segmentation, detection and tracking, have the potential to assist surgeons intra-operatively and improve the quality and outcomes of minimally invasive surgery. Different work streams towards instrument detection include segmentation, bounding box localisation and classification. While segmentation models offer much more granular results, bounding box annotations are easier to annotate at scale. To leverage the granularity of segmentation approaches with the scalability of bounding box-based models, a multi-task model for joint bounding box detection and segmentation of surgical instruments is proposed. The model consists of a shared backbone and three independent heads for the tasks of classification, bounding box regression, and segmentation. Using adaptive losses together with simple yet effective weakly-supervised label inference, the proposed model use weak labels to learn to segment surgical instruments with a fraction of the dataset requiring segmentation masks. Results suggest that instrument detection and segmentation tasks share intrinsic challenges and jointly learning from both reduces the burden of annotating masks at scale. Experimental validation shows that the proposed model obtain comparable results to that of single-task state-of-the-art detector and segmentation models, while only requiring a fraction of the dataset to be annotated with masks. Specifically, the proposed model obtained 0.81 weighted average precision and 0.73 mean intersection-over-union in the Endovis2018 dataset with 1% annotated masks, while performing joint detection and segmentation at more than 20 frames per second.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_47

SharedIt: https://rdcu.be/cyl2Y

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposed a multi-task model for joint bounding box detection and segmentation of surgical instruments using a shared backbone and three independent heads for the tasks of classification, bounding box regression, and segmentation. The work also develops an adaptive loss function with weakly-supervised label inference.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

a. Propose multi-task model of a shared backbone network (EfficientNet), feature fusion module and separate three heads for localisation, classification, and segmentation of surgical tool. b. Develop a semi-supervised learning technique with weakly supervised loss function by following [R1]

References: [R1] A. Vardazaryan, D. Mutter, J. Marescaux, and N. Padoy, “Weakly-supervised learning for tool localization in laparoscopic videos,” in Intravascular Imaging and Computer-Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

a. Less technical contribution. The proposed model seems a combination of two works EfficientDet [R1] and weakly supervised learning [R2].

b. The superiority of the model is not clear as it is not compared with other state-of-the-art multi-task learning models of detection and segmentation such as maskrcnn[R3], AP-MTL[R4]

c. The description of Table 1 in section 4.4 is not correct. For example “The performance drops from an IOU of 0.821 to 0.651 when reducing the masks to 20%, and to 0.544 when reduced to 1%. ”

d. I am curious to know how the author obtains the segmentation annotation for the test set. In section 4.1, it mentions the work follow the challenge for training and testing data split but as far I know the challenge didn’t provide the annotation for the test set. Moreover, it is also mentioned in the same section about the use of the annotation mask provided by this work [R5]. However, the work [R5] splits the training dataset to validate their model (Seq 2, 5, 9, 15 are used for validation).

e. Segmentation results in table 1 are not accurate to me after reducing the annotation mask. Obtaining 72.8% IOU with only 1% annotation is surprising to me in this dataset.

f. Section 4.4 mentions the accuracy of 0% annotated mask but it does not exist in the table.

g. The paper is missing the ablation study. To confirm the effect of each proposed module or loss function there should be an ablation study with proposed modules over baseline. h.Any multi-task learning model should be validated with the performance of single vs multi-task on the same architecture.

References: [R1] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2020. [R2]A. Vardazaryan, D. Mutter, J. Marescaux, and N. Padoy, “Weakly-supervised learning for tool localization in laparoscopic videos,” in Intravascular Imaging and Computer-Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis [R3] He, Kaiming, et al. “Mask r-cnn.” Proceedings of the IEEE international conference on computer vision. 2017. [R4] Islam, Mobarakol, V. S. Vibashan, and Hongliang Ren. “AP-MTL: Attention Pruned Multi-task Learning Model for Real-time Instrument Detection and Segmentation in Robot-assisted Surgery.” 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020. [R5] González, Cristina, Laura Bravo-Sánchez, and Pablo Arbelaez. “ISINet: An Instance-Based Approach for Surgical Instrument Segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2020.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The work uses an open dataset and most of the modules are well-established in computer vision problems.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

a.Wrong and missing data in table 1 and inconsistence description in section 4.4 b.Dataset split and annotation for training and testing needs to be clear
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

a. Less technical contribution b. The model is poorly validated and needs to compare with recent multi-task models c. The result table and descript are inconsistence d. Training and testing dataset split and annotation are not clear
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose a joint bounding-box detection and segmentation model for the surgical instruments on the Endovis-2018 dataset. The authors extended the EfficientDet -bounding-box detection model- with the segmentation head for the multi-task detection and segmentation. The main contribution of the paper lies in using the binary presence instrument labels as weak supervision to train the segmentation head along with x%(x=1, 5, 20, 100) of annotated masks. The results show the effectiveness of the binary presence labels as weak supervision signals to achieve good performance with as few as 1% of the mask annotations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper lies in effectively exploiting the binary presence labels for segmentation. The manuscript is well written and presented. Results show that the proposed approach can effectively segment the articulated instruments using less annotated masks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The weakly supervised formulation is limited to semantic segmentation and is not entirely adequate when multiple instruments of the same class appear in the input image.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Datasets: The authors have used the public EndoVis2018 dataset for the training and evaluation.
- Code: The authors have neither provided nor mention the availability of models, training/evaluation code upon acceptance.
- Experimental results: No result on the different hyperparameters setting. The authors used fixed hyperparamerters.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
The authors have shown promising results with their weakly-supervised approach for the joint bounding box detection and semantic segmentation using their multi-task model. However, the authors should address the following points.
- Weakly supervised formulation: The authors have used the binary presence of the instruments as weak supervision to train the segmentation head. However, when multiple instruments of the same class present in the image, then the formulation is not adequate. The baseline state-of-the-art approach ISINet [*1] used the instance segmentation instead of semantic segmentation as used by the authors. Since the authors have used the same extended EndoVIS 2018 dataset as of ISINet, authors should explain the rationale behind using the weakly supervised formulation in the semantic segmentation instead of instance segmentation.
- The authors have used the extended EndoVIS 2018 dataset from the ISINet [*1] that extended the dataset with the instrument label type. Since authors have used the extended dataset for the weakly supervised formulation, the result of ISINet in table 1 is not fair (0.71 IoU vs. 0.75 IoU ). The authors should clarify this, or I am missing something. Also, authors should provide the results of mean class IoU instead of IoU for the proper comparison as done in the ISINet paper.
- The explanation of the wAP metric is not there. Is it the COCO bounding box detection metric with a threshold from 0.5 to 0.95 or just at a single threshold of 0.5?
- The authors should mention that equation 6 corresponds to the Multi-Label Classification where a single image can contain multiple instruments.
- There is no improvement in the accuracy when the authors use 10%, and 20% annotated masks, as shown in the supplementary results. The authors should discuss this point.
[*1] C. Gonz´alez, L. Bravo-S´anchez, and P. Arbelaez, “Isinet: An instance-based approach for surgical instrument segmentation,” in MICCAI 2020
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The manual annotation of segmentation masks is an expensive and time-consuming pursuit. The authors use of weak labels increased the accuracy of the system when using few percentage of available annotated mask (0.663 -> 0.728 for 1%, 0.734 -> 0.791 for 5%, 0.732 -> 0.80 for 10%, 0.775-> 0.80 for 20%). The proposed formulation can help design a better system for the less annotated data regime.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The contributions to this paper are two fold: (a) the development of a multihead network with custom segmentation head (b) support weak supervision.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Very impressive and comparable detection and segmentation results with only 1% labeled data.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Limited mathematical details about the model and loss etc.
- Limited network training details.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Moderate. The network is explained and illustrated in figures, but a few critical details could use more explanation.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- The reviewer suggests that the authors add explanation of what Lreg, Lclf, Lws and Lseg are in the Fig 1 caption.
- The reviewer suggests that the authors provide sample image and sample outputs for each of the three heads in Fig1. It is a little confusing for the reviewer to imagine what the outputs look like and why Lws, Lclf are marked as Labels.
- Since self supervision is a big part of this paper, can the authors add more detailed explanations about how the model handles the sample images without label? There is a brief one sentence description in section 3.3 paragraph 2, but it seems very unclear to the reviewer.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed model performances with few labeled data samples are impressive, but the article will provide a lot more value for future readers if more model and design details are elaborated a bit more.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Two reviewers recommend acceptance of the paper while the third one recommends rejection. All reviewers consider the technical contribution sufficient and the weakly-supervised results adequate. However, reviewers also express serious concerns about the paper, which should be thoroughly addressed in the rebuttal. In particular, (1): Lack of comparison against recent multi-task models (R1), (2): Missing technical and implementation details (R1, R3), (3): Conceptual limitations of the approach (R2).
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We thank the reviewers for acknowledging that (META) “the technical contribution [is] sufficient”; “the weakly-supervised results [are] adequate”; (R2) “the proposed formulation can help design a better system for the less annotated data regime” while obtaining (R3) “very impressive and comparable detection and segmentation results with only 1% labeled data” and that (R2) “the manuscript is well written and presented”.

We respond below to all major comments hoping to clarify the raised concerns.

R1: the proposed model seems a combination of EfficientDet and WSL EfficientDet [4] showed good fully-supervised object detection and segmentation results and WSL [16] showed weakly-supervised object detection results in Cholec80. However, neither are multi-task and cannot solve both problems simultaneously nor can the methods handle very sparse annotations. Also, WSL was not evaluated for segmentation. Our work focuses on a combined multi-task approach and provides new value and contributions:

a multi-task model that simultaneously performs object detection and segmentation (Sec. 3.1)

a novel segmentation-head designed for multi-task challenges (Sec. 3.2)

a weak-supervision module and an adaptive loss for leveraging very sparse annotations for estimating highly-accurate segmentation (Sec. 3.3)

R1: the superiority of the model is not clear; comparison against multi-task models To the best of our knowledge, current multi-task models cannot leverage very sparse segmentation annotations, which makes the contributions of our work of interest for the MICCAI community. Instead of comparing against Mask R-CNN we compare against ISINet, as the latter is based and outperforms the former by using an improved temporal smoothing model [5]. To provide the reader with additional comparisons against multi-task models, we modify EfficientDet for joint detection and segmentation. This model (IOU 0.822, Table 1 in supplementary material (SP)) outperforms previous SOTA segmentation models such as ISINet (IOU 0.710, Table 1). We use the same open-source code for the metric computation. We agree with R1 that AP-MTL is very relevant, however, we are unable to compare against it as the code is not publicly available and only reports results in Endovis2017.

R2: ISINet results with extended dataset; R1: clarification on data splits ISINet reports results when training in Endovis 2018 only, and when training in both Endovis 2017 and 2018 (referred to as “additional data”) [5]. For a fair comparison with our model, we report the ISINet results when training only on Endovis 2018 (IOU 0.710, Table 1). We will add the mean class IOU results in the final manuscript. We use the same train/test splits as in [5] with sequences 5, 9, and 15 for testing and the remaining ones for training. We will update the manuscript to clarify this point.

R1: the description of Table 1 in section 4.4 is not correct; accuracy of 0% annotated mask does not exist; lack of ablation study That descriptive text in Sec. 4.4 is related to the segmentation-only model, whose results are reported in Table 1 SP. Results using 0% annotated masks are in Table 1 SP. Ablation studies on single-task (detection only), multi-task (detection and segmentation), and weak supervision are in Table 1 SP.

R2: weakly-supervised formulation limited to semantic segmentation; instance-segmentation formulation We agree with R2 that an instance segmentation formulation is an interesting research direction; however, Endovis 2018 dataset does not provide instance segmentation annotations. Note that ISINet managed to train an instance segmentation model in this dataset by assuming that there is only one instance per class in each frame, which is not ideal. Our future work will focus on extended datasets and models that allow us for specific instance-based formulations.

We appreciate the rest of the constructive suggestions that can be addressed in the revised manuscript

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal responds to the main concerns expressed by the reviewers. The final version should include all reviewer suggestions and requested clarifications.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After reading the paper carefully, the reviewers comment’s and the rebutal, I believe many of the problematic comments raised by the reviewers have been adequately answered by the authors. I would recommend the authors try to integrate as much as possible the comments made and refer to the supplementary material more thouroughly throughout the paper, as many imporant results are there.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a multi-task model for surgical instruments (bounding-boxes, classification, segmentation), based on a semi-supervised learning technique with a weakly supervised function. The method was evaluated on a public dataset. The initial reviews were quite mixed and major criticism included that this may be a combination of two existing approaches and that the evaluation was insufficient (missing ablation study, missing technical and implementation details). Some of the concerns were addressed in the rebuttal (e.g. ablation study is in the supplemental material) and the missing details will be added. I believe the method still has a major limitation as it probably cannot handle multiple instruments of the same class present in the image, but overall I do see the value of this submission and am inclined to vote for “accept”.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

back to top

Scalable joint detection and segmentation of surgical instruments with weak supervision