Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, Pheng-Ann Heng

Abstract

Real-time surgical phase recognition is a fundamental task in modern operating rooms. Previous works tackle this task relying on architectures arranged in spatio-temporal order, however, the supportive benefits of intermediate spatial features are not considered. In this paper, we introduce, for the first time in surgical workflow analysis, Transformer to reconsider the ignored complementary effects of spatial and temporal features for accurate surgical phase recognition. Our hybrid embedding aggregation Transformer fuses cleverly designed spatial and temporal embeddings by allowing for active queries based on spatial information from temporal embedding sequences. More importantly, our framework processes the hybrid embeddings in parallel to achieve a high inference speed. Our method is thoroughly validated on two large surgical video datasets, i.e., Cholec80 and M2CAI16 Challenge datasets, and outperforms the state-of-the-art approaches at a processing speed of 91 fps.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_57

SharedIt: https://rdcu.be/cyhRc

Link to the code repository

https://github.com/xjgaocs/Trans-SVNet

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a novel method to fuse spatial and temporal features in order to improve phase and action recognition accuracy in spatio-temporal networks.

To do so, the authors propose using transformers to encode a sequence of temporal embedded features and combine them with spatial features.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The approach is novel and interesting. It references all the relevant works in the area and compares them to very recent work in the field (e.g. TeCNO).

The paper fuses 3 different state-of-the-art networks in an intuitive way that makes sense. CNN network to extract spatial features, temporal convolutional network to create a temporal embedding from them, a transformer network to summarize temporal features and a fusion strategy to combine them all.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper is generally technically thorough and novel, but it would benefit for further analysis of impact of the different building blocks.

The performance of the proposed model is not too far from the one of TeCNO, so it is unclear to me whether the authors managed to obtain a better convergence from the same TeCNO model, or if their temporal aggregation really had an impact - as it is unclear if the authors report baseline results from their TeCNO implementation or results from TeCNO paper.

Standard table with incremental improvements when adding all the building blocks one by one would show some light on the relevance of each of the blocks.

Authors report a 97fps performance… and while this might be true for the decoder part, I find this very hard to believe for the end to end pipeline, as ResNet50 alone would have to be very very well optimized (e.g. TensorRT) in order to achieve such frame rate.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors use public datasets + will make their code available soon.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

See above. Paper is well writen, interesting and novel. Would benefit from further comparative tables showcasing the influence of the different building blocks relatively.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Good paper, interesting additions to rapidly growing field of phase detection, that is adapting state-of-the-art technology from CVPR at a very fast pace.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper proposes a new network architecture for surgical workflow segmentation. The main difference relative to prior work consists in aggregating temporal and spatial features for the final classification decision, as opposed to the more common approach of performing classification on temporal features alone (after extracting them from spatial features). This is accomplished via a transformer sequential model and results slightly improve over the state-of-the-art in common benchmark datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Paper is well written. Motivation and contributions are clear.
- Experiments are well detailed and consider reasonable ablation scenarios that justify architecture choices
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Improvements to state-of-the-art (vs. TeCNO which should have similar time performance) seem relatively incremental, but it might just be that current benchmarks are reaching its natural plateau. In this sense, some expressions such as “signicantly outperforms the state-of-the-art” and “outperforms the state-of-the-art methods by a large margin” sound like overstatements.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code was provided, but both the algorithm and the experimental set-up are well detailed and seem clear enough. Results are tested on publicly available datasets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Table 1 - Can you comment on TeCNO and Trans-SVNet having the same number of parameters (24.7M)? I would assume the proposed method would have more: i. e. the TeCNO parameters (embedding model) plus all parameters in the Aggregation model. Possibly a relatively small increase due to dimensionality reduction?

“Though a multi-step solution like OHFM, our approach” - grammar in this sentence looks odd

On feature dimensionality reduction: In the embedding, spatial features are reduced to l’ with length 32 as input to TCN. In aggregation these are reduced to tilde(l) with an unspecified length N.
- What is this value N?
- How were these values chosen, and how fundamental is their tuning for model convergence?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Paper presents an incremental improvement to an established problem. Improvement is based on a novel idea, it is clearly presented and well justified, and overall seems reproducible - and therefore a useful contribution.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper utilizes Transformer layer to introduce self-attention and spatial-temporal feature fusion for surgical phase recognition. The proposed method shows better performance compared to several state-of-the-art methods. The ablation studies reflect the reasoning of introduce Transformer layer into solving the problem.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors propose a reasonable way of introducing Transformer into surgical phase recognition problem. It utilizes the most of the spatial and temporal features.

The ablation studies contains convincing evidence of the motivation and assumption of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposed method is not novel. Transformer has been used in different medical related problems for a few years.

The paper does not contains all the explanation or data to support its conclusion. In the abstract, the authors highlight the framework is lightweight and processes embeddings in parallel to achieve a high inference speed. I could not find anything supports this such as inference speed, parallel inference only applying to the proposed methods.

The presentation could be improved. I notice the authors use different symbols for explaining Transformer layer (q, s) and the hybrid embedding aggregation (l, g). The Attn() mentioned in Transformer but Trans() is used when describing the hybrid embedding aggregation. It would help if the authors could rephrase this part of inconsistency.

The point concerns me most is the experiment. I would totally agree with the authors that M2CAI16 and Cholec80 were large surgical video datasets when they were first used around 2015 and 2016. After half of a decade has passed, we have kept up with the state-of-the-art deep learning model with millions of parameters but the datasets never follow.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No concern for reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Any improvement or clarification for the points mentioned in the weaknesses part are appreciated, including: It would be nice if the authors could provide details how the framework is lightweight compared to other frameworks and how fast the inference process is. Or maybe removing the conclusion in the abstract.

It would be nice if the authors could unify the functions/symbols/terms used in paper presentation.

It would be nice if the authors could try their proposed methods on large (in the year of 2021) datasets and compare with state-of-the-art methods. Though I understand how M2CAI16 and Cholec80 are convenient for researchers since everyone uses the same data and evaluation protocol to get the metrics.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The rating is given based on the lack of novelty, could-be-improved presentation, and using not-really-large datasets.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

1
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a transformer-based method for surgical phase recognition that uses both the aggregation of spatial and temporal features for improving classification decision. The paper is well written in general but requires improving presentation in some sections as highlighted by the reviewers. The novelty of the paper needs to be highlighted as in the current presentation it merely appears as integration of 3 SOTA methods. Moreover, the reviewers have raised concerns regarding performance comparison with TECNO, incremental improvement reporting missing, computational comparison missing (no justification of model being lightweight provided), parameter comparison between TECNO and proposed not justified. These concerns should be addressed in the final version.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Author Feedback

We thank the reviewers and AC for their efforts in evaluating our manuscript. All reviewers appreciate our proposed Transformer-based method to conduct hybrid embedding aggregation for surgical phase recognition. We respond to the concerns raised during review and describe changes to the manuscript to address them.

Reply to Reviewer 2

R2 is confused about the TeCNO baseline. We reported the re-implemented results of TeCNO as we did not use tool presence labels as TeCNO paper. And, this re-implemented TeCNO model directly generates our needed temporal embeddings without changing parameters. R1 finds it hard to believe the reported processing speed. We agree that most of the time consumption is occupied by ResNet50. And all the vectors to be processed after ResNet50 have a very small dimension (less than 32), which makes the computing time negligible compared to the spatial feature extraction.

Reply to Reviewer 4

R4 is confused about the number of parameters. In fact, the parameter increment of our Trans-SVNet towards TeCNO is 29,568, thus is negligible relative to 24.7M. N is the dimension of the one-hot phase label, i.e., 7 for Cholec80 and 8 for M2CAI16. As for the hyperparameters (2048, 32, N, etc.), we employ the same settings from the TeCNO official code without further tuning.

Reply to Reviewer 5

R5 is concerning about the parallel inference advantages of our methods. We argue that this speed boost is mainly compared with LSTM-based methods that are employed by many previous works. R5 is not satisfied with the public datasets we used. In fact, these two datasets basically set up a standard benchmark and are widely used in the phase recognition task for validate the method. In the future, we will further validate our method if new dataset is publicly available and even using our collected dataset. BTW, our framework is lightweight because of the negligible parameter increment as stated above. With well-trained ResNet50, i.e., spatial embeddings available, our method only needs several minutes to converge using one GPU.

We thank again all reviewers for their invaluable comments and will revise our final version accordingly.

back to top

Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer