Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Ge-Peng Ji, Yu-Cheng Chou, Deng-Ping Fan, Geng Chen, Huazhu Fu, Debesh Jha, Ling Shao

# Abstract

Existing video polyp segmentation (VPS) models typically employ convolutional neural networks (CNNs) to extract features. However, due to their limited receptive fields, CNNs cannot fully exploit the global temporal and spatial information in successive video frames, resulting in false positive segmentation results. In this paper, we propose the novel PNS-Net (Progressively Normalized Self-attention Network), which can efficiently learn representations from polyp videos with real-time speed (\textbf{$\sim$140fps}) on a single RTX 2080 GPU and no post-processing. Our PNS-Net is based solely on a basic normalized self-attention block, equipping with recurrence and CNNs entirely. Experiments on challenging VPS datasets demonstrate that the proposed PNS-Net achieves state-of-the-art performance. We also conduct extensive experiments to study the effectiveness of the channel split, soft-attention, and progressive learning strategy. We find that our PNS-Net works well under different settings, making it a promising solution to the VPS task.

SharedIt: https://rdcu.be/cyhLG

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

This paper proposes the PNS-Net (Progressively Normalized Self-attention Network), which can efficiently learn representations from polyp videos with real-time speed. Attention mechanism is effectively introduced for polyp segmentation.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Main strength of this paper is to introduce self-attention mechanism to segment colonic polyp regions from colonoscopic video images. Simple normalized self-attention block is new idea.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposed method was examined by using only public database. I think this is ok for segmentation performance comparison. I would expect more real-data are used for evaluation to consider real clinical problem. But this comment does not affect my overall decision.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Detailed parameters are shown in the supplemental file. I think reproducibility is fine.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

This paper shows a method to segment poly regions from video colonoscopic images. Paper is well-written and is clear to understand.

1. My clinical side question is how do your treat vague boundary of a polyp. How can you segment flat polyp or concave polyp regions from videos? How about segmentation performance for such polyps? In nature of medical imaging, the boundary cannot be defined clearly. If you ask several medical doctors, they return different answers. How do you treat such cases?

2. `Regressive NS Can you demonstrate the effect of “Progressive” NS?

3 Illustration of NS block I think the arrow of X is opposite in the magnified illustration of NS. I think that is simple mistake in drawing.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methods and the results shown in the paper is very convincing. Technical background is ok. These are basis my judgement.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

A novel Progressively Normalized Self-attention Network is proposed, for video polyp segmentation in polyp videos in real time. The critical component is a normalized self-attention block to learn efficient spatio-temporal representations of polyps. Experiments on multiple datasets, reporting on multiple metrics and good ablation tests are presented.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Interesting training protocol: Pre-training on still images + video frames, after removing normalized self-attention block. Then fine-tuning with NS block and video data Very impressive results, on multiple metrics and datasets, and different SOTA comparisons and good ablation experiments.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The entire subsection 2.1 on normalised self-attention is math heavy, with no intuitive explanation of the approach. This is the only section that I would recommend rewriting- it is very dense and gives no help to the reader on why it makes sense and is worth a novelty tick.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Appears to be reproducible

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

see section 4

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

see section 3

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

2

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

This paper introduces a video polyp segmentation model PNS-Net. It contains (1) a normalized self-attention (NS) bloack consists of non-local attentions with channel splits, restricted attentions capturing spatial-temporal relationships, and normalized query Q (2) a progressive learning strategy to maintain high-level semantic features from the proposed NS blocks. Detailed quantitative, qualitative and ablative studies are given on 3 datasets. The proposed method achieve state-of-the-art performance including a much higher inference speed of 140 fps vs 108 fps.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) the proposed method is new in this task. The authors adopt and improve the widely adopted self-attention mechanisms in this video segmentation task with the proposed two mechanisms and proved the effectiveness through ablative studies and comparisions with previous methods.

(2) the proposed method not only achieve state-of-the-art performance in segmentation measurements, e.g. IoU, but also achieve faster speed 140 fps, especially for medical domain where training data is usually limited. This could be benefitial for other similar video segmentation tasks.

(3) the paper is well written and demonstated clearly.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) Channel split v.s. Multi-head. One of the contribution introduced in this paper is the Channel Split mechanism. It is very similar to the multi-head attention originally proposed in the Transformer paper (cite in this paper as [21]). The authors should discuss the difference between them.

(2) Query-Dependent Rule: How does the kernel size k selected? The authors mention that various k are selected but in the paper only k=3 is selected and no ablative studies for k. The authors may include an ablative study on k.

(3) How long does it take to train the model including pre-training and fine-tuning?

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors mentioned that ‘all the training data, models, results, and eval- uation tools will be released.’

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

For the pre-training part, a common approach in video segmentation in natural images is to train the Non-local blocks as well as the backbone, e.g. STM, Video Object Segmentation using Space-Time Memory Networks by Wug Oh et al. The authors may give it a try.

strong accept (9)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method proposed in this paper is new in this task. Qualitative and quantitative results show the effectiveness of the proposed method.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a new method for polyp segmentation based on self-attention network. The paper is well-written, experimental details and comparison are thorough and convincing showing the effectiveness of the proposed method. All minor comments raised by the reviewers should be addressed in the camera-ready.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

# Author Feedback

We thank the meta-reviewer for handling our submission and the reviewers for their unanimously positive ratings (9,7,7). We are encouraged that they find our model is technically sound (R1), our training protocol is interesting (R2), our paper is well written with convincing results (R1, R4). We respond (Re) to each comment below.

Reviewer #1 (R1) Q1. Real data for a real clinical problem. Re. Thanks. We plan to construct the large-scale densely annotated dataset for the VPS task in the journal extension, containing diverse clinical data from multiple centers.

Q2. How to treat vague boundaries and flat or concave polyp regions? How about the performance? Re. We address these problems via mining the spatial-temporal correlations, and thus, the unnoticeable/concealed polyp is detected due to its movement patterns. We achieve 0.801 Dice and 0.846 Dice for vague (5 clips) and flat (3 clips) cases.

Q3. Different medical doctors return a different answer. Re. In general medical datasets, each polyp is labeled by multiple clinicians, and the final mask is determined by voting. Our model is trained using the standard ground-truth labels provided by the public dataset.

Q4. Effectiveness of ‘Progressive’ NS. Re. We have discussed the effectiveness of ‘progressive’ in sec. 3.3 (last part) and table 2 (#7, #8, #9). The concept of ‘progressive’ is equivalent to the re-optimization process (coarse-to-fine).

Reviewer #2 (R2) Q1. No intuitive explanation of the approach Re. We apologize for causing this confusion due to page limitations. Per your suggestion, we will add more intuitive visualization to our journal version as well as the website.

Reviewer #4 (R4) Q1. Channel split VS. Multi-head. Re. They share the same spirit to an extent. Indeed, the specific implementation manner of multi-head in the transformer is channel split. This operation can ensemble different attention regions in the network because various heads may focus on a different region in the feature map.

Q2. How does the kernel size k select? Re. We adopt the dilated convolution with a 3x3 kernel size due to the performance-efficiency trade-off. The motivation is that a large kernel with large dilation rates may damage the integrity in the spatial-temporal representation. In our previous attempts, this may degrade the performance and increase the computational burden. Our journal extension will further investigate the synergy effect between the kernel size and dilation rate.

Q3. The pre-training/fine-tuning time. Re: It takes about 10 hours (pre-training) on Kvasir and positive part of ASU-Mayo and 35 minutes (fine-tuning) on training set of CVC-300 and CVC-612, respectively.

Q4. The improvement of pre-training strategy. Re: We are pleased to apply this fantastic training strategy to our model. We will cite this work in the final version.