Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Adam Schmidt, Aidean Sharghi, Helene Haugerud, Daniel Oh, Omid Mohareri

Abstract

Automatic surgical activity detection in the operating room can enable intelligent systems that potentially lead to more efficient surgical workflow. While real-world implementations of video activity detection in the OR most likely rely on multiple video feeds observing the environment from different view points to handle occlusion and clutter, the research on the matter has been left under-explored. This is perhaps due to the lack of a suitable dataset, thus, as our first contribution, we introduce the first large-scale multi-view surgical action detection dataset that includes over 120 temporally annotated robotic surgery operations, each recorded from 4 different viewpoints, resulting in 480 full-length surgical videos. As our second contribution, we design a novel model architecture that can detect surgical actions by utilizing multiple time-synchronized videos with shared field of view to better detect the activity that is taking place at any time. We explore early, hybrid, and late fusion methods for combining data from different views. We settle on a late fusion model that remains insensitive to sensor locations and feeding order, improving over single-view performance by using a mixing in the style of attention. Our model learns how to dynamically weight and fuse information across all views. We demonstrate improvements in mean Average Precision across the board using our new model.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_60

SharedIt: https://rdcu.be/cyhRf

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a new dataset on multi-view surgical action detection dataset that includes 120 surgery operations with four viewpoints. This is a valuable dataset for the field of surgery analysis. The other contribution is the proposed model for action detection based on the multi-view videos.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Both the dataset and action detection model are important. The dataset is more valuable since it is the first large-scale multi-view dataset for surgical action detection in this area.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) It would be nice if the dataset can also provide the videos captured by normal cameras. Currently the dataset only provides the videos captured by ToF sensors which has some advantages. If the authors think the normal videos cannot be provided, please clarify the reason. (2) From Table 2, it looks that the performance gain of the proposed method compared with [20] mainly comes from the single-view model. For example, for the first row “OR”, comparing “1-view” and [20], the performance gap is almost 20 points. However, comparing “1-view” and “2-view” or comparing “2-view” and “4-view”, the performance gain is only 1-2 points. Therfore, it looks that adding more views does not help that much. Can the authors give more explanations about this phenomenon?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors do not claim that they will release the dataset or code in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please see the weakness.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The multi-view dataset for action detection in surgery videos is new. I believe it will be a good research direction.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper works on activity recognition in the operating rooms by a multi-view formulation. The major contribution is a large-scale diverse multi-view video dataset, with results on this dataset showing the effectiveness of the multi-view formulation. The other contribution is an attention-based view fusion module, which gives some slight improvements in the experiments.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea of multi-view is well-motivated to handle occlusion and clutter in clinical situations.
2. A large-scale multi-view dataset, with multiple surgeons, procedure types, and operating rooms included.
3. Experiments evidently show the performance using multi-view is better than using only single-view.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. No proper discussion on previous multi-view datasets. The authors claim that there is no existing suitable dataset for multi-view surgical activity recognition. But this reviewer thinks the CATARACTS dataset of the EndoVis challenge could be relevant. In the EndoVis 2018 CATARACTS dataset, there are two views provided, i.e., microscope videos and tray videos. And later on, EndoVis 2020 CATARACTS is for surgical workflow recognition, which is relevant to the activity recognition in the paper.
2. Results in Table 3 show that the proposed fusion module does not significantly outperform baselines (90.26 vs. 89.13 / 90.19), which undermines the second contribution of the paper.
3. Some experiment setup is not clear.
  - In section 5 paragraph 4 last line, why test the model in a bidirectional manner? Does this contradict the statement to learn a model that can run online (section 4 paragraph 3)?
  - The authors say that the single-view model outperforms [20] in Table 2 because new videos are added to the dataset. But in this case, the results will be unable to directly compare because different data is used.
4. The writing and clarity in the method section need to be improved. See below for details.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility is good in general. But the clarity in the method section need to be improved for better reproducibility. This reviewer also encourages the authors to release the data for better reproduction.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The clarity in the method section is not as good.
  - In sections 4 and 4.1, the process of video synchronization and alignment is not clear.
  - Why Eq. 1,2,4 keep the superscript ‘i’? Are there multiple mixers for each view or a single mixer for all views? This question of mine might be the consequence of the unclear statement of the process of video alignment.
  - The mixer in Fig.2, although conceptually understandable, does not match the text well. If only look at the figure, the views seem to be directly weighted by some transformation of the fused features. The attention mechanism in Eq.3 is not illustrated in the figure. -Why not use the already defined symbol ‘g_multi’ in Eq.4? -In abstract line 2, “more efficient surgical workflow and efficiency” is wordy.
2. It will be better to have some visualizations to see how well the model works on occlusions.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This reviewer thinks the usage of multi-view and proposed large-scale dataset are clinically meaningful. And in view of that the weaknesses could be hopefully improved by a careful revise, this reviewer recommends ‘Probably accept’.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper studies the problem of multi-view surgical video action detection. A multi-view action detection framework is proposed. A new dataset is introduced.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) This paper introduced a new dataset for action detection in multi-view surgical videos. 2) An attention mechanism is leveraged to fuse multi-view video information to detect actions. 3) Experiments on the new dataset shows that the proposed method outperforms existing method for surgical action detection in operation rooms. 4) Ablation study shows the proposed fusion method is effective.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) The proposed mixer in Fig. 2 appears to be more than just weighted sum, followed by an fully-connected layer, which is not the same as equation (4). This needs to be clarified. 2) The paper didn’t report precision, recall, and F1 score as in [20]. Performances for different surgical procedures are also not reported. 3) More details about the model architecture and implementations should be clarified, such as the number of units in the GRUs, the number of layers, etc.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The proposed method is clear, but details are needed to reproduce the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

In Table 1, the numbers are not well formated.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduced a new dataset for multi-view surgical action detection. A new multi-view fusion method is proposed for this problem. The experiments results also verifies the effectiveness of the method.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper received collective positive reviews. However, there are some aspects of the work that need further clarification. Please consider addressing those in the final version. These include Methodological details; A review of the literature and previous multi-view datasets for similar applications; Justification for adding more views based on the experimental results; Clarification on the experimental setup, etc.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

N/A

back to top

Multi-View Surgical Video Action Detection via Mixed Global View Attention