Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Jinglu Zhang, Yinyu Nie, Jian Chang, Jian Jun Zhang

Abstract

Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling multimodal context.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87202-1_28

SharedIt: https://rdcu.be/cyhQq

Link to the code repository

https://github.com/lulucelia/surgical-instruction

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper describes an application of Transformers Neural Networks coupled with reinforcement learning for surgical video captioning, trained and evaluated on the DAISI public dataset. The authors show that transformers networks outperform state-of-the-art video captioning techniques in the context of surgical videos in all tested metrics (BLEU, CIDEr, METEOR, ROUGE-L and SPICE). The results open new research avenues in surgical video understanding.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Very well written paper, with clear explanations of the approach and results
- To the best of my knowledge, this work is one of the first successful applications of transformers neural networks in surgical video captioning
- Very comprehensive evaluation of the algorithm against various state of the art approaches (Bi-RNN, LSTM, LSTM + soft-attention), using 5 different types of metrics
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

From a methodological standpoint, the paper mostly adapts a transformer network with reinforcement learning-based training, approach that was already published in the computer vision domain, to the use case of surgical videos. Nonetheless, this is not necessary a limitation: transferring AI techniques from a domain to another often requires a significant amount of work despite the overall methodological concept stays the same, and the results worth being presented.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will provide the source code should the paper be accepted. The algorithm is also trained and tested on a public dataset. Training parameters are detailed. Results should therefore be reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Introduction and methods are well written and clear
- Section 3.1: ++ Based on which criteria did the authors curate the database? Who did the curation, was she an expert? Do the authors plan to share the curated database? ++ Is the data split per video or per image? Is the split evenly distributed throughout video type? ++ Typos: “upon original authors” -> “upon request ..”; “is consisted” -> “consists”;
- Results are convincing and comprehensive. The discussion on current limitations is appreciated.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is very well written. The method is sound. The experiments are well defined and executed, with comprehensive evaluation against state-of-the art algorithms. The paper clearly demonstrates the potential of transformer networks for surgical video captioning. For all these reasons, I recommend acceptance of the paper.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The manuscript presents a novel structure for surgical instruction prediction. The proposed approach includes a transformer-backboned encoder-decoder network with self-critical reinforcement learning.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The novel application of encoder-decoder structure with fully backbone transformers.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The clinical contribution of the paper is not well explained.
2. It is unclear how some main parametered are defined.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

A public available dataset is used, but the range of main parameters for tuning process is not mentioned.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. There are some grammatical issues in the text, for example this sentence in Introduction section is confusing: One of the earliest medical report generation works is based on natural language is [12]. Furtheremore, the complete form of the most abbreviations is not presented.
2. Among all the available structures, what was the reason to use ResNet-101 module for visual feature extraction? please clarify this.
3. In section 3.1, it is unclear whether or not the images of each procedure were exclusively tied to separate sets.
4. In section 3.1 and 3.2, only some fixed values of parameters are mentioned. Did you perform any hyperparameter tuning process or did you followed an emprical approach to set these values?
5. Section 4.1 is a little confusing. Why did you need to re-implement the code of reference 20? As I understood, you used the same dataset as reference 20. Why can’t you compare your results with their presented results? Please clarify this.
6. As there is no information regarding the computation time, there is not evidence to see if the proposed approach is efficient or not.
7. In section 4.3, small dataset is listed as one of the limitations. I understand the explanation, but this is one of the main drawbacks of all deep-learning-based approaches. In the medical field, having the acceptable number of images is always challenging, but I think having a dataset with 16000 images is sufficient.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the paper presents an interesting method, the arguments for having such a system in real clinical environment is poor. No information about the computation time is included in the manuscript. Hence, it is difficult to evaluate the efficiency of the proposed approach in the clinical field.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The article introduces a transformer-backboned encoder-decoder network with self-critical reinforcement learning to automatically caption surgical images. The technique is validated with the DAISI dataset and compared against the state-of-the-art techniques with ablation studies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method is novel and the validation is very thorough.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although metric-wise, the proposed method is superior. Some image caption results show that the caption may interpret the image in different directions (e.g., the needle cap image), and this may not be fully captured by the metrics provided. However, it is a complex and challenging problem to solve.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

There are sufficient details for reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. A few abbreviations were not properly introduced before use, such as CIDEr, SPICE, BLUE, METEOR. Although references were added, it will be good to briefly describe what they measure.
2. The word “prediction” is used in the title and article. However, as the algorithm is not foreseeing the next action of surgical tasks, it is just image caption, not “prediction”. This should be revised accordingly.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novel algorithm, good results, and thorough validation.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The reviewers commented on the clinical significance of this work, its novelty, and its comprehensive experiments. One reviewer had some concerns over clarity issues.

Strengths:
- One of the first successful applications of transformers neural networks in surgical video captioning
- Well written paper, with clear explanations of the approach and results
- Comprehensive evaluation of the algorithm against various state of the art approaches (Bi-RNN, LSTM, LSTM + soft-attention), using 5 different types of metrics
Weaknesses:
- Some details need to be clarified (see Reviewer #1 and Reviewer #2 comments)
- Explanation on the clinical contribution of this paper can be improved
- Some typos and grammatical issues need to be fixed (see Reviewer #1 and #2 comments)
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

We thank the reviewers for their valuable reviews and we are happy that they found our work: “one of the first successful applications of transformers ” (R1, AC), “novel method” (R2, R3), “open new research avenues” (R1), “well-written” (R1, AC), “thorough and comprehensive evaluation ” (R1, R3, AC).

Contributions. We provide a transformer backboned architecture to generate instructions from surgical images. The encoder-decoder framework jointly encodes the surgical content from current view and perceives the dependencies between spatial information and its corresponding text description. In addition, we apply self-critical reinforcement learning to improve the model performance after the general cross-entropy (XE) training. The evaluation results demonstrate the ability of transformer in handling multi-modal context while introduce new research avenues in surgical video understanding.

Potential clinical contribution (R2). The potential clinical contribution can be summarized as two-fold: 1. There has been a growing interest of building context-aware surgical system by utilizing available information inside the operation room (OR) to provide clinicians with contextual support at appropriate time. Understanding the surgical activity and generating the description is a prerequisite for building such a system. 2. Providing intra-operative surgical assistance is imperative when on-site mentoring is unavailable or a rare case is detected. Previously, this can be only achieved by telementoring, which exchanges medical information through video and audio in real time. We will add more details to explain the clinical contribution in the final version.

Data selection criteria (R1). According to the original authors [20], the image-text pairs from the dataset are compiled from the input of expert surgeons from five different medical facilities. The availability of the dataset is upon request from the original authors.

Data split (R1, R2). The dataset includes 16413 color images from 290 surgical procedures, while some types of surgical procedures only have one sample due to the limited dataset scale. Therefore, we split the data per image. The train/validation/test split of the dataset is 8:1:1. We will make it clearer in the revised version.

Small dataset (R2). We state the DAISI dataset as a small dataset because for one: The dataset contains 290 surgical procedures, while some types of procedures only appear once. For another, surgical instruction generation is a multi-modal problem, which relates visual, text, and the relationship between them. Therefore, the solution space is much larger than other tasks (e.g. classification, segmentation).

ResNet-101 for visual feature extraction (R2). We choose ResNet-101 to extract the visual feature because it delivers representative deep features and strong signals without vanishing gradient issues for down-stream tasks [7, 19].

Code re-implementation (R2). We cleaned the original dataset in [20] by removing noisy and wrong image-text pairs. Thus a new benchmarking is required. We re-implemented the code of [20] because their code is not publicly available.

Hyperparameters tuning (AC, R2). We follow the hyperparameter settings referring related works [8, 19]. For our layer specification and training (e.g. learning rate and no.of epochs), we tune the hyperparameters based on the results on the validation set.

Computational time (R2). We have trained all the models with a single NVIDIA GTX1080 graphics card. It takes around 30 hours for the training process (30 epochs for general XE loss, and 30 epochs for reinforcement learning). We will clarify it in the final version.

Abbreviation introduction (R2, R3). Thanks for the corrections. We will introduce the abbreviations before using them in our revised version.

Typos and grammatical issues (AC, R1, R2, R3). We will carefully check and fix the typos and grammatical issues in our revised version.

back to top

Surgical Instruction Generation with Transformers