Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Tobias Czempiel, Magdalini Paschali, Daniel Ostler, Seong Tae Kim, Benjamin Busam, Nassir Navab

Abstract

In this paper we introduce OperA, a transformer-based model that accurately predicts surgical phases from long video sequences. A novel attention regularization loss encourages the model to focus on high-quality frames during training. Moreover, the attention weights are utilized to identify characteristic high attention frames for each surgical phase, which could further be used for surgery summarization. OperA is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos, outperforming various state-of-the-art temporal refinement approaches.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_58

SharedIt: https://rdcu.be/cyhRd

Link to the code repository

https://github.com/tobiascz/OperA

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a new algorithm for surgical workflow segmentation. The main novelty lies in using the transformer network for the temporal model instead of the state-of-the-art LSTM / TCN. The key contribution in making this model to work involves an attention regularizer that weights CNN input quality during the training of the sequential model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The visualisation of highest/lowest attention frames for each phase is interesting and expands on the algorithm contribution
- Well written and clearly presented
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Algorithmic novelty is relatively small
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code not available, but method and relevant parameters are clearly presented. Results on public benchmark reproducible Results on in-house dataset may not be reproducible if this is not made available
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Cholec80 is typically split into fixed training/test data. Using a 5-fold cross validation is perfectly reasonable, but makes results in this paper slightly less comparable to the relevant literature.

Also, It’s not entirely clear how the 5-fold split was done, given other statements about 20 in 80 test videos for cholec80 (4-fold?), and 20 in 85 videos for new dataset.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Relatively incremental accuracy improvements to an established problem, but visualisation of highest/lowest attention frames adds interest to the paper and potentially promotes discussion into new ideas in surgical workflow segmentation.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The paper presents a novel attention-based deep architecture for video-based surgical phase recognition, consisting of 11 stacked transformer encoder blocks that process frame-wise video features, which in turn are extracted with a ResNet-50 (trained on phase recognition and, if labels are available, tool presence detection as well). Moreover, the authors propose to add a regularization loss so that more attention is being paid to those frames that correspond to confident phase predictions. Evaluation is performed on the publicly available Cholec80 dataset and a comparably large in-house dataset. Moderate improvements over the typical CNN-LSTM baseline are observed.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- A temporal model based on self-attention has not been used for surgical phase recognition before and can thus be considered novel, innovative, and interesting for the community
- The attention regularization loss is another novelty that extends the well-known transformer encoder architecture
- Paper is well written and includes a concise but comprehensive overview of related work as well as appealing visuals
- Comprehensive evaluation including both ablation experiments and comparisons to the state of the art
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors deviate from the common 40 + 40 (or 40 + 8 + 32) data split on Cholec80 in order to use more of the available data for training. This unfortunately makes it difficult to compare the results to prior and concurrent work. Can the authors add the results on the 40 + 8 + 32 data split for backwards compatibility?
- A few other things regarding the evaluation seem strange and need to be clarified or fixed, see detailed comments.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Source code will be made available
- Evaluation was performed on at least one publicly available dataset
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Evaluation
- How is it possible to perform 5-fold cross-validation on a data set with 80 samples when each fold (respective test set) consists of 20 samples? Do the folds overlap?
- The reported standard deviations are very small because they refer to the variation of the means calculated for every fold. However, the variation within the samples of the data set (or each data fold) is much more interesting and should be reported instead.
- Looking at the supplementary material, I suspect that the F1 score is calculated at the very end as the harmonic mean of average precision and average recall. Please check this, because this approach leads to overly high F1 scores – you should calculate F1 individually for every phase and every sample (= video) in the dataset and average over phases and samples afterwards.
- I think it would be good to clearly state that you reproduced the results of the previous methods on your dataset/ data split (this kind of benchmarking is also an important contribution) and add some more details on how you reproduced the results. The results of MTRLNet-CL are considerably worse than in the original paper – can you comment on reasons for that?
Can you elaborate more on why adding the positional encoding deteriorates results? I feel that including information regarding the order of the video features should actually be important for a problem such as phase recognition.

I think it would be interesting to additionally visualize the attention weights (complete matrix) in order to see which frames attend to which other frames specifically.

I had to re-read section 2.2 a few times in order to understand that the “normalized” attention for frame j simply quantifies how much attention is being paid to this frame on average, by all frames i that are “allowed” to attend to this frame (i.e. M_ij = 1, which is the case if i >= j). I think it would help to clearly define A_ij and M_ij so that the calculation of n_j is easier to understand – for example: A_ij quantifies how much attention is being paid by frame i (query) to frame j (key) and M_ij = 0 if j > i else 1 (because frame i can only attend to frame j if j <= i)

Typo: “potentially potentially” (page 6)
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

novelty of the method and convincing evaluation results
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The paper proposes to apply Transformer to tackle the surgical phase recognition task. Attention regularization is proposed to bypass relying on the low-confidence frames too much. Experiments on two datasets (one public, one in-house) are performed for validating the effectiveness of the method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Applying the transformer to surgical phase recognition task
- Extensive experiments are performed and detailed ablation studies.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The performance improvement is sort of marginal.
- The results of two comparison methods [14] [15] are not consistent with their original paper. The fairness of such comparison needs to be verified.
- The method novelty is sort of weak, which mostly relied on the well-established transformer framework.
- Some notations are not clearly described in the Method Section, lead to the difficulty to understand the method.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of the paper is satisfatory.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Major:
- The method novelty is sort of weak. The paper is mostly based on the tranformer architecture, a well-established framework that recently widely is adapted for addressing various computer vision tasks. Another main contribution pointed out by authors, i.e., attention regularization, whose novelty is sort of weak, cannot support the paper to above the standard of MICCAI acceptance.
- Some notations are not clearly described, lead to the difficulty to understand the method, such as the meaning of M_ij in Sec 2.2, the definition of L_c in Sec 2.3.
- Inconsistent results of comparison method ‘ResLSTM’, MTRCNet-CL’. The results shown in Table 2 on Cholec80 dataset are not consistent with (quite lower than) the results reported in their original paper [14] and [15].
- For CSW dataset, why the results from ‘ResLSTM’ list, while ‘MTRCNet-CL’ not list?
- The result improvement is sort of incremental, with only 0.4%-1.2% increase in Acc and F1.
- How the frame-wise attention regularization (Sec2.2) affect the final performance? Cannot see the validation of this key component in the experiment part. Minor:
- What does the ‘causal masking’ mean?
- What does VRAM mean?
- The language expression should be improved.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Marginal improvement with poor expression.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposed to used transformer networks for the task of surgical phase recognition. The reviewers have raised several concerns related to the use of 5-fold cross validation over the standard Cholec80 split making the proposed method comparison difficult to other SOTA methods not reported. Moreover, 5-fold split detail is also missing. These needs to be clarified and justified explicitly. The reproduced 5-fold result of ResLSTM and MTRCNet-CL are considerably inconsistent with the original papers. Why is that the case? The authors should address the reviewers’ comments and should also high the key novelty of this work.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Author Feedback

We would like to thank the reviewers for their supportive and constructive feedback. We would also like to thank them for acknowledging the attention regularization loss as a “novelty that extends the well-known transformer” architecture (R2). Further, we are pleased that our “visualization of the highest/lowest attention frames adds interest to the paper” (R1) which is solidified by a “comprehensive evaluation” (R2) through “extensive experiments” (R3).

1) MR1, R1, R2 commented that details and motivation regarding the 5-fold cross validation could be provided as it is deviating from the standard Cholec80 split. Since this dataset only has 80 videos, we believe it is important to perform cross-validation to ensure generalizability over data splits which we also think is a ‘reasonable’ (R2) decision. The paper introducing Cholec80 [12] also deployed 4-fold cross validation. In this work we increased the number of videos in the validation set from 8 to 12, as we believed that 8 videos is not a big enough sample size to find the models that generalize best to the test set. For the same reason we also increased the number of training videos from 32 to 48. Among those 48 training and 12 validation videos we performed 5-fold cross validation. Following all previous works [12,14,15,16] we kept an unseen test-set (in our case 20 videos) that was not part of the cross-validation and was used only for testing. In the test phase we used the best model of each cross-validation split and averaged the results of those 5 models on the unseen test set. All the results we report in this work, including the baselines, were created using the same split and cross-validation approach. We will make sure to discuss this data split and motivation in the final version. Our code will also be made publicly available to allow easy reproducibility.

For backward compatibility (R2), we ran the the original (32-8-40) data split of Cholec80 and the results are consistent, MTRCNet-CL: Accuracy (ACC): 84.53, Precision: 77.28, Recall: 78.10 and OperA: ACC: 89.96, Precision: 82.75, Recall: 84.45.

2) In line with previous work published at MICCAI 2020 [16] which report 4% worse ACC for MTRCNet [15] in their evaluation, we also find MTRCNet-CL to be 4% inferior than reported in [15] as correctly indicated by MR1, R2, R4. For the reproduction of MTRCNet-CL, we used the authors’ code available on [G1] to train a model with our data split along with all hyperparameters in [15]. For the calculation of the metrics we followed the standard paradigm in [12]. Note that the results reported for ResLSTM are consistent and even higher than the original paper [14] (original ACC in [14]: 86.4, reproduced ACC: 87.94). Hence, we believe that our reproduction is fair.

[G1] https://github.com/YuemingJin/MTRCNet-CL

3) As R2 mentioned, the F1 we reported is the harmonic mean of the Precision and Recall calculated following [11,12]. This will be clarified in the final version.

4) We thank for the valuable suggestion of R2 to optimize the understanding of the Normalized framewise attention section 2.2. R3 also commented on the clarity of this section. Hence, we will add the following addition to the paper: A_ij quantifies how much attention is being paid by frame i (query) to frame j (key). M_ij will be zero if the key index j is larger than the query index i as we want to restrict our model to only respect previous events in the video. M_ij = 0 if j > i else 1

We believe that our work is a valuable contribution to the MICCAI Community as the visualisation of the highest and lowest attention frames enhanced by the proposed attention regularization extending the transformer architecture, opens a novel direction for surgical phase recognition.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addressed the main concerns. The provided experimental details and justifications must be included in the camera ready. All minor comments must all be addressed in the camera ready.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper studies an interesting problem of video-based surgical phase recognition, the technical novelty is relatively limited and the improvement over the existing method is not significant. In the authors’ rebuttal, the response to 5-fold cross-validation is not convincing. I believe in cross-validation, in each fold, one should use a different test set. This major concern has not been adequately addressed in the rebuttal. While I see some merits such as the visualization of the highest and lowest attention frames, I think the paper still needs to be further improved before publication.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Agreeing with reviewer #4, the AC did not see sufficient technical novelties. The transformer architecture in Fig.1 is from an ad-hoc assemble, as described in Sec 2.1. The attention regularization in Eq.3 is not significant enough. From the experiments in Table 2, the improvement (+2.21% acc, +0.45% F1 on Cholec80; 0.68% acc, 1.26% F1 on CSW) seems quite minor. The variances of the fold-test are even higher than some of these improvement metrics.

All reviewers raised the questions on the experiment comparisons. The same setup should be adopted with the original papers, for fair comparison. The inconsistent results reported in this paper and the original paper are concerns. New cross-fold comparisons with re-implemented codes can be the addition.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

Meta-review #4

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

ACs had conflicting remarks for this paper. Therefore, the PCs assessed the paper reviews including meta-reviews, the rebuttal and the submission in more detail. It seems that remaining concerns on cross-validation may be based on a misunderstanding. PCs agree with the primary AC that the main concerns which were related to the cross-validation split and to the comparisons with state-of-the-art were adequately addressed in the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

-

back to top

OperA: Attention-Regularized Transformers for Surgical Phase Recognition