Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Kun Yuan, Matthew Holden, Shijian Gao, Won-Sook Lee

Abstract

Surgical workflow anticipation, including surgical instrument and phase anticipation, is essential for an intra-operative decision-support system. It deciphers the surgeon’s behaviors and the patient’s status to forecast surgical instrument and phase occurrence before they appear, providing support for instrument preparation and computer-assisted intervention (CAI) systems. We investigate an unexplored surgical workflow anticipation problem by proposing an Instrument Interaction Aware Anticipation Network (IIA-Net). Spatially, it utilizes rich visual features about the context information around the instrument, i.e., instrument interaction with their surroundings. Temporally, it allows for a large receptive field to capture the long-term dependency in the long and untrimmed surgical videos through a causal dilated multi-stage temporal convolutional network. Our model enforces an online inference with reliable predictions even with severe noise and artifacts in the recorded videos. Extensive experiments on Cholec80 dataset demonstrate the performance of our proposed method exceeds the state-of-the-art method by a large margin (1.40 v.s. 1.75 for inMAE and 2.14 v.s. 2.68 for eMAE). The code is published on https://github.com/Flaick/Surgical-Workflow-Anticipation.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_59

SharedIt: https://rdcu.be/cyhRe

Link to the code repository

https://github.com/Flaick/Surgical-Workflow-Anticipation

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

A surgical workflow anticipation method is proposed, which uses a combination of an instrument interaction module and a temporal model with a multi-stage temporal convolutional network (MSTCN). The method is evaluated on the Cholec80 dataset in terms of mean absolute error (MAE).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The method is evaluated on a public dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The motivation of the work is not entirely clear. Why is the addressed problem important for clinicians and what is the difference between the works on remaining surgery time estimation? This work seems to be an incremental improvement of remaining time estimation with a slightly different target.
- The description of the method is confusing, since it is not clear from the beginning, what type of neural network is used for instrument interaction detection.
- The evaluation method is not complete. Instead of evaluating the method only with MAE, it should be evaluated also with the actual users, i.e., the surgeons, for the achieved benefit.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The presented approach should be largely reproducible, since a public dataset is used for evaluation. I am not fully sure about the technical details of the method though - it would need further investigation to see whether it is fully reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

One of the problems with this work is the motivation: why is surgical workflow anticipation important for clinicians and how does it differ from surgical workflow analysis? To me the presented work is kind of incremental and I do not see a big difference to the works in references [20] and [26] (and other similar works). This is also confirmed by Fig. 1 which shows that the anticipation task is a real-time remaining time prediction task.

Section 2.1 seems to be incorrect: in my opinion it should not read x_1 but rather i_1, since the symbol i is representing a frame. Also, tau and alpha is not clearly defined.

Another issue is the fact that the description of the network architecture does not clearly express whether an object detection network (or an R-CNN) is used. It is only mentioned that CNNs are used for the instrument interaction module. This is confusing, since YOLOv3 is then mentioned in Section 3.1 (Experiment Setup) but not mentioned in Section 4.1 again, which only states ResNet50. It is also unclear why an old version of YOLO has been used, while v4 is already available since about one year - and why it has been favored over Faster R-CNN or Mask R-CNN, which provide better performance.

Finally, although the evaluations show promising performance in terms of MAE, it remains unclear how fast the method works (no evaluation about run-time performance) and what benefit it really brings to surgeons.
Please state your overall opinion of the paper

strong reject (2)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors for my overall rating are: unclear motivation, unclear description of the method, and unclear clinical benefit.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose a deep learning method, called IIA-Net, to predict phases and instrument utilization. IIA-Net toke 5 different outputs for each frame, video frame, semantic map, bounding boxes, instrument presence, and phase. First, they extract features thanks to different networks: visual features by RestNET 50, instrument-instrument, and instrument-surrounding interactions by a new model called IIM. On this paper instrument presence and phase are added from the ground truth. The authors consider that this information can be provided by other types of models. These features are used to the anticipated phase thanks to an MSTCN model. Finally, the feature and the anticipated phase are used to make the instrument prediction. IIA-Net has compared to the state-of-the-art model in Cholech 80 dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper is the introduction of IMM a novel instrument interaction module using a semantic map and bounding box as input. Secondly is the extraction from different models of feature as input to performed phase and instrument prediction. Finally, the proposed method is compared with state-of-the-art methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness is the need to compute semantic map, bounding box, presence of the instrument, and phase for each frame. So, it is difficult to ensure that the complete method is suitable for clinical use, especially if the computation time to create these inputs is too expensive.

The second weakness is, for the validation, the mix between computed (semantic map and bounding box) and ground truth (instrument and phase inputs). It is difficult to know how the quality of the computed inputs influences the prediction, and if computed instrument presence and recognition of phase will impact the results. However, this is taking into account by authors and note as future works.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

According to information provide on the paper and the checklist, this paper seems to be reproducible especially with the release of the code with acceptance. Data are publicly available .
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Except for the main weakness, which could only be addressed on an additional journal paper, I only have minor comments. The authors used 3 metrics inMAE, eMAE, and pMAE. The last one is defined in 22, to help to the readability, it could be good to remind the definition of pMAE. Tab1 caption indicates that the metrics are the inMAE/eMAE, but the corresponding text mentions inMAE and pMAE. Which one is the correct metric for this tab, eMAE or pMAE?
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Despite main weaknesses, which one could not be addressed for a conference paper, the paper is clear, the method innovative, the validation complete, the limitations highlighted.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

7
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The authors propose a MS-TCN model for anticipating both surgical phases as well as instrument usage before they occur based on laparoscopic video. Their main contribution lies in extracting information about the observed surgical scene (such as the current phase, instrument interactions and anatomical structures) and using it as additional features for the temporal model. Also, phase anticipation is newly introduced as prior work only anticipated instruments. Extensive experiments on the Cholec80 dataset show that they achieve superior results on both anticipation tasks compared to prior work which only uses visual features.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The experimental results are very strong as they clearly outperform prior work.
- Although the model relies on rich annotation on the dataset, it shows the potential of anticipating surgical events when the model is provided with adequate information. Since the performance ceiling of anticipation is not known, this model is also very informative as an upper bound for less annotation-intensive approaches.
- Even without non-visual features, the 2-step ResNet+TCN might outperform the end-to-end AlexNet+LSTM model which is very interesting for future work. However, this would be more conclusive if the pMAE was also reported in the ablation study, since inMAE and eMAE do not capture “false positives”.
- Evaluation is done on a large dataset with several insightful metrics and an extensive ablation study.
- The authors introduce inMAE and eMAE, which are insightful new metrics for anticipation.
- Section 4.3 discusses the limitations of the method, which is very valuable.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- By relying on the availability of phase and instrument information as well as trained networks for instrument detection and scene segmentation, the applicability of the model in strongly limited to academic datasets like Cholec80 (for now). This stands in contrast to the motivation of the baseline method of Rivoir et al., which only requires image-level labels for the events of interest. Nevertheless, for settings with rich annotation, this is a very strong model and can serve as an upper bound for further research (see ‘Strengths’).
- In the ablation study (tbl. 1), only inMAE and eMAE are reported. Due to the analogy of pMAE and inMAE to the classification metrics precision and recall, they might provide a more complete evaluation of the ablated models. These metrics were also chosen in table 2 and would make the overall evaluation more consistent.
- Although the authors promise to publish the code, a few more details regarding the training setup would be helpful. There is no information on loss functions or hyperparameters of the two-stage training setup.
- The simulated images used to train the segmentation model were generated by training a GAN on the Cholec80 dataset (including the test set of this work). This could cause a minor data leak, as the segmentation network was trained to recognize anatomical textures which themselves were trained to resemble textures in the test data. However, the leaked data is only indirectly “visible” to the model and doesn’t leak ground truth for anticipation. I believe it is not a major flaw.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors promise to publish their code, which makes the method reproducible. However, the training process could be described in more detail in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- A few more details on the training process would benefit the clarity of the paper.
- Using metrics inMAE and pMAE for the ablation study would make the evaluation more consistent. Alternatively, it could be elaborated why eMAE was chosen here.
- The possible data leak regarding the segmentation network could be mentioned in the limitations sections as it is basically a weaker form of the ground truth used for phase and instrument presence.
- In section 3.1, is the segmentation model really from [5] (YOLOv3) or is that an incorrect citation? If I understand correctly, this model can only do boundingbox detection.
- The eMAE improves drastically when adding the IIM module. Is there an explanation/hypothesis for this?
- In the instrument-surrounding module in Fig. 3: Why are the instrument bounding boxes drawn on the feature map learned from the segmentation map? If I understand correctly, boudning boxes and distinction between different instrument types are not ulitized here.
- Why is the name “instrument-surrounding” chosen here? I find it a little bit misleading since the there is no mechanism which explicitely analyzes structures that are close to instruments (if I understand correctly) and most spatial information is probably lost in the subsequent pooling operation anyway.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors provide a strong new model for anticipating events during surgery. The task is very relevant for CAS and the model clearly outperforms previous work. Although the model requires more supervision / pretrained models, the method is still interesting as it shows that events can be anticipated more accurately when more explicit information (e.g. current phase, visible instruments/organs) is known.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

6
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a novel DL method for anticipating events during the surgery. Overall, the reviews are very positive and the strengths of the paper are the novelty, experimental setup and the results. However there are some issues raised related to the motivation, the description of the method and the clinical benefit. I would like to encourage the authors to clarify these as well as state the run-time performance (is it actually applicable in a real-world setup?).
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Author Feedback

Review1: Running Time: Our experiments show that the running time per frame is Max(YOLO, UNet, ResNet50)+MSTCN = Max(0.0090, 0.0087 ,0.0142)+0.0151= 0.0293s, 10% faster than [22] taking 0.0328s. The Max operation is because several models can run in parallel. The speed shows our model is applicable in a real-world setup. Motivation: Our work benefits in three areas (Maier-Hein et al., Nat Biomed Eng, 2017; https://doi.org/10.1038/s41551-017-0132-7). First, tool anticipation offers a useful reference for decision making in a robotic assistance system. It helps to identify tool usage triggering so that a robotic system can decide when to intervene. Second, for context-aware assistance, anticipating tools such as irrigator can help early detection and prevention of potential complications,e.g., massive haemorrhage. Third, phase anticipation allows real-time instruction for automated surgical coaching. Difference from prior work: Our work is one of papers for surgery time estimation. However, one prominent distinction compared to previous papers is that our work deals with the tool/phase-wise remaining time estimation while the others only predict the terminating instant. Therefore, our work enjoys finer time granularity. [20] aims to indicate the current tool/phase but ours can forecast which tool/phase will come into play. [26] merely cares about the surgery’s completion but ours can indicate each of the tool/phase individually throughout the surgery. Also, our work outperforms prior works by inferring surgeon’s action through IIM. Method Description: The instrument interaction detection does not involve any network. Instead, we represent the instrument interaction feature as a length-4 feature vector in Eq.1 and calculate it by combining two tool-detection-bounding boxes’ coordinates . Network Description: Thank you for pointing out. The unclarity comes from our mistake in the citation. We use UNet rather than YOLO to extract segmentation. In Fig.3, the detection & segmentation from YOLO & UNet are used in IIM. We favor YOLO over Faster R-CNN because the former can meet real-time pursuit better. YOLOv4 was not explored as YOLOv3’s performance was good enough for our use. We will correct the citation and update the experiment with YOLOv5. Evaluation: We agree that the real-world test is needed and consider it as a next step. The current study shows feasibility of the proposed method, which is a necessary step prior to real-world testing. Real-world testing involves many substeps, including the hospital’s ethics approval, patients’ consents, surgeon’s collaboration and availability of extra funding, etc. For now, we have added the qualitative results in the supplementary material. Alongside MAE, we hope these promising results will encourage surgeons’ collaboration. Review2: Running Time: Please refer to reviewer 1 for details. Phase & Tool Signals: Thanks for raising the weakness. Prior works on phase/tool detection on Cholec 80 is reasonably accurate, meaning that the real-world signal will not degrade the result a lot. We plan the relevant analysis in future work and consider realistic datasets where these signals might not be available. Review3: Limit on Dataset: Please refer to reviewer 2 for details. eMAE Hypothesis: Our hypothesis is that the interaction feature from IIM is better at short-horizon anticipation. The feature bears a form of [tool, action, anatomy]. We used the Surgical Action Triplet Recognition 2021 challenge in MICCAI and computed a histogram to confirm that action is an essential clue to forecast next 1 minute’s future. Instrument-Surrounding: Sorry for ambiguity in our figure. The extra bounding boxes are removed in the new version. As you said, we do not model the instrument-surrounding feature explicitly. Instead, the semantic map contains the tool location (blue part in Fig.3), through which the surrounding features are modeled implicitly.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a DL-based method to predict phases and instrument utilization during the surgery. The reviews are very positive and the authors convincingly addressed the concerns that were raised, in particular the motivation and the relation to existing methods. I would like to encourage the authors to double-check the real-time performance comparison to [22] as this can depend on many factors (e.g. Hardware, Framework version,…), but mention their inference performance in the final manuscript as this is a major factor for this application. I do agree with the authors that for a feasibility study a real-world setup, as recommended by R1, may be out of scope. The proposed method has been evaluated on a public dataset and compared to other state of the art.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper proposes a new architecture (IIA-Net) for anticipating surgical phase and instrument usage from video data of the Cholec80 dataset, taking instrument interactions into account. Surgical workflow anticipation is a topic of interest in the CAI community, the paper is well-written, the methodology is innovative with strong validation and experimental results. Concerns from reviewers (including regarding run-time performance and motivation of the work) have been addressed by the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed all major concerns highlighted by the meta reviewer in the rebuttal and have provided the run-time performance. The motivation, method description and clinical benefits have been clarified. These justifications must be included in the camera-ready.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

back to top

Surgical Workflow Anticipation using Instrument Interaction