Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Jiacheng Wang, Yueming Jin, Liansheng Wang, Shuntian Cai, Pheng-Ann Heng, Jing Qin

Abstract

Performing a real-time and accurate instrument segmentation from videos is of great significance for improving the performance of robotic-assisted surgery. We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration. However, most existing works perform segmentation purely using visual cues in a single frame. Optical flow is just used to model the motion between only two frames and brings heavy computational cost. We propose a novel dual-memory network (DMNet) to wisely relate both global and local spatio-temporal knowledge to augment the current features, boosting the segmentation performance and retaining the real-time prediction capability. We propose, on the one hand, an efficient local memory by taking the complementary advantages of convolutional LSTM and non-local mechanisms towards the relating reception field. On the other hand, we develop an active global memory to gather the global semantic correlation in long temporal range to current one, in which we gather the most informative frames derived from model uncertainty and frame similarity. We have extensively validated our method on two public benchmark surgical video datasets. Experimental results demonstrate that our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_33

SharedIt: https://rdcu.be/cyhQv

Link to the code repository

https://github.com/jcwang123/DMNet

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper
The authors propose a novel architecture for incorporating temporal information into a semantic segmentation scheme for surgical instruments. They propose several architectural features
- ConvLSTM to aggregate features computed for prior frames
- An attention mechanism to process local features
- A global memory mechanism to store feature maps from confident frames that are more distant in time
- Global and local features are combined in attention mechanism to produce final segmentation
- Comparative evaluation on a range of recent segmentation works, showing a slight improvement in performance and a fast runtime.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method is novel, and the authors explain their design in great detail which is appreciated. The evaluation and ablation study demonstrate that the method proposed does improve results over some prior methods and over the baseline architecture. This evaluation is done on public datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of the paper is that the motivation behind the design doesn’t really convince me. I don’t see why temporal information should be so helpful for semantic segmentation of known objects and there also isn’t really any prior work in using this type of data in the mainstream cv community to fall back on. The author references TDNet as also using temporal information but this logic does not really follow as TDNet composes features from shallow networks computed from previous frames as an approximation for a deeper network at the current frame and and also includes a compensation for geometric deformation. It does this for performance reasons, arguing that you don’t need to compute full feature maps at every frame.

The results do show an improvement over prior methods, however these prior methods were mostly re-implemented by the authors. Although the authors cannot be faulted when public code is not available, it does weaken the comparison as selection of hyper parameters or small errors in the implementation can have large effects on the results. Additionally, I think the evaluation would have been much stronger if the same metrics of instrument parts used in the EndoVis 2018 challenge so that comparison could have been directly made against the method in that challenge.

The ablation study is also only performed on EndoVis 2017, this is less convincing than if it was performed over both 2017 and 2018 datasets.

I also don’t think the comparison with TDNet is entirely fair. Firstly the frame rate in the MICCAI datasets is far lower than the typical frame rates in the original paper, which would understandably impact TDNet’s performance since it wasn’t designed to run on a subsampled video.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors state that code will be released after publication and the benchmarking is done on public datasets to the method should be mostly reproducible. The comparison methods were re-implemented by the authors and should also be released for a full comparison and for verification of their correctness.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The most impactful improvement the authors could make to this work would be to provide a closer comparison to the EndoVis 2018 results. These are more modern and powerful methods than the UNet used in TernausNet and would provide a better benchmark. Re-implementations of other papers methods are not very convincing.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I don’t feel strongly about accepting or rejecting this paper. It was the best paper I reviewed but I don’t think the solution adds much to the state of the art, and I don’t see much follow on work using this type of technique. The method is not very elegant and doesn’t follow a very clear logic as to why it should improve things, that being said- it does have better results than some other recent published work.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The manuscript addresses the problem of surgical-tool segmentation in endoscopic videos relying on spatio-temporal analysis.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is relatively well written
- Several experiments are performed
- Figures and tables are clear and useful
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The state of the art is not comprehensive. There is work on surgical-tool analysis in endoscopic videos using spatio-temporal features that is not discussed.
- It is not clear why the authors select frames randomly from Gt for aggregation.
- No ablation study relevant to tau and n is performed.
- The choice of the comparison methods is not fully clear (e.g., why was [9] chosen?)
- The description of the ablation study could be improved to be clearer.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

An effort could be done to add more methodological details and foster reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- The state of the art and contribution section could be summarized to convey the information in a clearer and more structured way. This would allow giving more spaces to the discussion section.
- Fig. 1 should be improved to be clearer. For example, why there is only one frame under “global frames”? All the symbols in the figure should be defined.
- The motivation behind the choice of the comparison methods should be justified better.
- From Table 2a, it seems that the major contribution is given by ELA. A discussion on this is missing.
- What is the model without ELA and AGA?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Despite the manuscript bringing some novelties, the experimental analysis and the discussion of the results should be improved. Some methodological choices (especially for the AGA part) are not justified appropriately.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper proposed a novel approach for real-time instrument segmentation of robotic surgical video by introducing a dual-memory network (DMNet) - a local memory by taking the complementary advantages of convolutional LSTM and non-local mechanisms towards the relating reception field, and a global memory to gather the global semantic correlation in long temporal range. By utilizing both global and local spatio-temporal knowledge to augment the current features, the paper reports improved segmentation performance and capability of the real-time segmentation on two public data sets: EndoVis17 Instrument Challenge and EndoVis18 Scene Segmentation Challenge.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The goal of surgical instrument segmentation is to yield a segmentation map from each frame of the video in real-time and this is an important component to facilitate robot or human manipulation. This paper proposed a novel dual-memory architecture utilizing local temporal dependence and global semantic information to enable holistic aggregation in real-time setting. The architecture design is novel and the rationale driven the design is intuitive from the perspective of human perception.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
In this paper the author tried to address the problem of real-time surgical instrument segmentation. The offline experiment results look promising on the EndoVis data sets. The paper would be further enhanced if the author could show results from one of the following:
1. A real-world application using the proposed mechanism to show the real benefit of the proposed method;
2. A dataset that is different from EndoVis, to show the generalizability of the proposed method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The implementation details is in good form to ensure the reproducibility of the experiment results in the paper. The paper states that “Codes will be publicly available” which will expedite reproducing the results and facilitate future research in the related fields.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
This paper discussed an interesting and relevant topic: real-time surgical instrument segmentation, and proposed a novel approach to utilize both local and global information to achieve a tradeoff between computational efficiency and segmentation accuracy. In general the paper is in good form and the rationale of the proposed method is clearly written.

The reviewer has a few questions:
1. In the proposed active global aggregation criteria (page 5), is the prediction confidence r a negative number? The same notation was later defined as “prediction entropy”, which is missing a negative sign compare to the conventional definition of “entropy” in information theory. Could the authors please clarify the criterion?
2. Was the global memory built in temporal order? How was the latest queued feature map able to provide enough information for promoting the diversity?
3. How was the size of the local and global memory determined? Why the parameters alpha and beta as the average of r, s of all frames?
4. In Table 1, in the comparison between the proposed DMNet and TDNet [9], why the proposed method has lower FLOPS but higher time?
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The topic of the paper is interesting and very relevant to CAI. The method proposed in the paper is novel and based on clear intuitions. The experiments are well designed and the experimental results looks promising. This paper is clearly structured and well written.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #4

Please describe the contribution of the paper

This paper proposes a dual-memory network (DMNet) by holistically and efficiently aggregating spatio-temporal knowledge for instrument segmentation from surgical videos. The model incorporates RNN, self-attention mechanisms, and an active learning strategy to capture local long-range information and global longer-span content. The method is validated in two public benchmark surgical video datasets and obtained superior performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

a. Develop a dual-memory network (DMNet) with RNN, self-attention mechanisms and active learning strategy to capture local long-range information and global longer-span content. b. Propose Efficient Local Aggregation (ELA), and Active Global Temporal Aggregation (AGA) which boosting the model performance c. The model is well validated and obtained superior performance over existing instrument segmentation models from surgical videos
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

a. The paper failed to discuss related works of dual memory network for video segmentation or detection. The author clarify the novelty of their dual memory network over the existing dual memory networks such as [R1, R2, R3].

b. The author proposes efficient local aggregation (ELA) module but Memory enhanced global-local aggregation is already proposed here [R3]. This paper should be mentioned and specify the differences of aggregation.

c. The Methodology section seems to formulate all the basics instead of only novelties. Basic formulation can be explained at the beginning with preliminaries/background section. It is hard to distinguish what is novel and existing technique

d. How the test data performance was obtained? As annotation for test set was not provided.

e. How is the model validated without annotation of the test set if all the provided training sets are used in training? Or how to choose the best epoch?

References: [R1] Zhang, Kaihua, et al. “Dual Temporal Memory Network for Efficient Video Object Segmentation.” Proceedings of the 28th ACM International Conference on Multimedia. 2020. [R2] Shi, Zhenmei, et al. “DAWN: Dual Augmented Memory Network for Unsupervised Video Object Tracking.” arXiv preprint arXiv:1908.00777 (2019). [R3] Chen, Yihong, et al. “Memory enhanced global-local aggregation for video object detection.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The datasets are publicly available and most of the training information are provided
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

a. The author clarify the novelty of their dual memory network over the existing dual memory networks. b. It is hard to distinguish what is novel and existing technique in method section. c. Require clarification in validation process. Even validation performance can be provided.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

a. Develop a novel dual-memory network (DMNet) with Efficient Local Aggregation (ELA), and Active Global Temporal Aggregation (AGA) which boosting the model performance.

b. Clear explanation requires to highlight the novelties of proposed dual memory network over existing dual memory networks.

c. The model is well validated and outperform existing methods.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

7
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The reviewers agree that the proposed method for semantic segmentation of surgical instruments in endoscopic videos is interesting and well explained. In particular, the key strengths are the idea of integrating spatial-temporal information in a deep learning architecture and the better performance is shown via the ablation study on public datasets. However, there are several issues that were raised by the reviewers and I would like to encourage the authors to integrate this feedback into the paper. In particular:
- Please include the metrics that were used in EndoVis 2018 to allow for a fair comparison
- Clarify the concern raised regarding the comparison to TDNet and other state-of-the-art methods that include spatial-temporal information
- Experimental analysis and discussion of the results could be improved
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Author Feedback

N/A

back to top

Efficient Global-Local Memory for Real-time Instrument Segmentation of Robotic Surgical Video