Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Lingyun Wu, Zhiqiang Hu, Yuanfeng Ji, Ping Luo, Shaoting Zhang

Abstract

Precise localization of polyp is crucial for early cancer screening in gastrointestinal endoscopy. Videos given by endoscopy bring both richer contextual information as well as more challenges than still images. The camera-moving situation, instead of the common camera-fixed-object-moving one, leads to significant background variation between frames. Severe internal artifacts (e.g. water flow in the human body, specular reflection by tissues) can make the quality of adjacent frames vary considerately. These factors hinder a video-based model to effectively aggregate features from neighborhood frames and give better predictions. In this paper, we present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues. Spatially, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions. Temporally, STFT proposes a channel-aware attention module to simultaneously estimate the quality and correlation of adjacent frames for adaptive feature aggregation. Empirical studies and superior results demonstrate the effectiveness and stability of our method. For example, STFT improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets, respectively, and outperforms the state-of-the-art video-based method by 3.6% and 8.0%, respectively. Code is available at https://github.com/lingyunwu14/STFT.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_29

SharedIt: https://rdcu.be/cyl53

Link to the code repository

https://github.com/lingyunwu14/STFT

Link to the dataset(s)

https://giana.grand-challenge.org/PolypDetection/

https://polyp.grand-challenge.org/AsuMayo/

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes to improve localisation of polyps in colonoscopy videos exploiting neighbourhood frames by their proposed spatial-temporal feature transformation framework where multi-frame collaboration between networks is obtained. The authors tries to address inevitable challenges in endoscopy videos such as occlusions and artefacts.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Motivation is clear. However, the problem is well-known and could be shortened
- Proposition of proposal guided spatial transformation to recover missed localisation is interesting
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Lack comparison with most recent SOTA detection methods. I would at least want to see YOLO-v5, RetinaNet and PraNet
- Standard PASCAL VOC detection metrics should be used. mAP at different IoU thresholds and performance comparison with different size of polyps is missing.
- Are the splits for the training on each method same? It is not clear. So, it is not clear if comparisons are valid.
- What happens when the target and support frames are completely different?
- Inference time is not reported. Real-time is an eminent requirement in clinical usability of these networks.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Please provide a clear train-val and test split in experiments
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Introduction is too lengthy. Authors could benefit by shortening it and writing more review on deep learning methods that has been used to tackle such problems. e.g., Authors can benefit from: 1) Understanding the effects of artifacts on automated polyp detection and incorporating that knowledge via learning without forgetting, arXiv, 2020. 2) Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy, Medical Image Analysis, 2021.
- Both datasets are well explored. It would be interesting to incorporate more recent datasets.
- Fig. 4 should include a ground truth row.
- Generalisability test is required as it is claimed in the conclusion. A generalisability test would be training on CVC-Clinic data and testing on ASUMayo dataset.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- SOTA comparison can be improved
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper present an interesting Spatial-Temporal Feature Transformation to aggregate features effectively from neighborhood frames. Spatially, it can mitigate the inter-frame variations in the camera-moving situation, and its channel-aware attention module can also estimate the quality and correlation of adjacent frames for adaptive feature aggregation temporally.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Well written paper with good organization
2. Interesting ablation study shows the ability of the proposed STFT
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. What’s the range of neighbor frames are chose in your study? You just mention that “sampling from a large range neighborhood”. I am also curious that during the training, only 2 frames are randomly selected, is that enough to model the neighboring information?
2. The comparison methods seem not enough. You only compare to image-level object detection method, which is unfair comparison. Some video-level object detection method should be considered, such as “STEP: Spatio-Temporal Progressive Learning for Video Action Detection”.
3. In subsection “Effectiveness of Channel-Aware”, why only compare to Point-wise and channel wise separately? How is the results compared to the combination of them as in the original paper.
4. It would be great to visualize the attention from other frames/positions, i.e., showing how did other frames/spatial contribute to the final detection with the help of the proposed STFT layer
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Seems reproducibility, using the public data, showing the detail implementation
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

See 4
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting STFT for the video-level Polyp detection, but lacks experiments to demonstrate its effectiveness against the existing spatial-temporary methods
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper proposes a multi-frame framework with a proposal-guided spatial transformation to enhance the feature alignments and a channel-aware attention module for the task of polyp detection and localization in endoscopic videos.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- novel spatial-temporal feature alignments for polyp detection
- good ablation study to show the effects of different components
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- the datasets used were relatively small
- performance is not very good on one of the datasets evaluated
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper provides sufficient technical details for reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- for the temporal target alignments, will the module depends on the frame rate?
- related to the previous items are the support frame uniformly sampled according to the target frame? would other motion adapted sampling help?
- the multiple support frame inferences for a single target frame prediction looks quite heavy, are they really necessary given that the neighbouring frames share some similarity.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

this is a novel application of proposal-based detection and spatial-temporal feature utilisation.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a novel method for polyp detection in endoscopic video but did not completely demonstrate the ‘effectiveness’ compared to other methods.
The reviewers raised several comments including small dataset used, incomplete comparison, missing comparison with SOTA spatio-temporal methods and not using the standard metric for object detection. The reviewers concerns should be addressed in the rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We sincerely thank the meta-reviewer and all reviewers for the constructive comments. We will release source codes and models for reproducibility.

Q1(R2, R4): Relatively small datasets used. Expecting for more recent datasets and the inference time. A1: Thanks for your suggestions. To the best of our knowledge, the two datasets used are the only publicly available benchmarks for polyp detection in video format with complete annotation. To further demonstrate generalization and capacity, we conduct external experiments on the large-scale ImageNet VID dataset (30 categories, 3862 videos for training and 555 videos for validation) and report results in terms of standard mAP metric. Compared with SOTA methods, including FGFA and RDN, STFT achieves higher performance and less inference time with fewer parameters using the same backbone ResNet50. 　　　mAP(%) Params(M) InferTime(ms) FGFA　74.0　　89　　　　647 RDN 　76.2　　53　　　　163 STFT　76.4　　43　　　　136

Q2(R2, R3): Lack comparison with most recent SOTA detection methods. Lack comparison with existing spatio-temporal methods. Only comparing image-level methods is unfair, video-level methods should be considered. A2: As described in Table 1, we compare STFT with 4 video-level methods and evaluate the gains of each method relative to its image-level counterpart. Among them, AIPDT(miccai2020) is the recent SOTA method on the CVC-Clinic dataset, both FGFA(iccv2017) and RDN(iccv2019) are spatio-temporal methods. Besides, more comparisons with most recent image-level and video-level spatiotemporal SOTAs are further explored as follows, including YOLOv5, RetinaNet(iccv2017), PraNet(miccai2020), and MEGA(cvpr2020). Under the localization F1-score metric on two datasets, STFT performs best. 　　　　　 YOLOv5 RetinaNet PraNet MEGA　STFT CVC-Clinic　 81.8　　85.4　　88.8　　87.8　　91.4 ASU-Mayo　 82.0　　87.6　　87.9　　88.1　　98.3

Q3(R2): Standard PASCAL VOC metrics (mAP) should be used. A3: Precision, recall, and F1-score are standard performance metrics for polyp detection and localization tasks and widely used in literature. This is because mAP evaluates with varying thresholds, but a fixed threshold is required in clinical practice. Please refer to Q1&A1 for mAP performance on the ImageNet VID dataset.

Q4(R4): Performance is not very good on one of the datasets. A4: Please refer to Table 1, STFT performs best on comprehensive metrics F1-score on two datasets and two tasks. Although the OptCNN[21] achieves higher recall on the CVC-Clinic dataset, its lower precision means a higher false-positive rate, which is not acceptable in clinical practice.

Q5(R3, R4): Are 2 support frames for training enough? Impact of support frame numbers for inference? A5: Thanks. We further investigate using different numbers of support frames in training and inference. * indicates default setting. Under the localization F1-score metric on two datasets, training with 2 frames achieves better accuracy (6 frames reach the memory cap). For inference, as expected, performance improves slowly as more frames are used and stabilizes. Combining Table 1 and 4, STFT always achieves the highest localization F1-score and is insensitive to frame numbers. training frames　　　　　　2　　　　　　　　　　　6 inference frames　2　　6　10　14　18　　2　　6　10　14　18 CVC-Clinic　　　91.1 91.3 91.4 91.5 91.4　90.3 90.5 90.7 90.6 90.7 ASU-Mayo　　　98.2 98.2 98.3 98.3 98.3　95.6 95.7 95.8 95.8 95.8

Q6(R3, R4): What’s the range of neighbor frames? Sampling manners for support frames? A6: The range is 18, [-9,9] relative to the target frame. Please see Section 3.1, we adopt a temporal dropout strategy, by discarding random temporal frames. During training, 2 support frames are randomly sampled from [-9,0] and [0,9] respectively. During testing, 10 support frames are uniformly sampled from [-9,9].

Q7(R2): Are the splits for the training on each method same? A7: Yes, for a fair comparison, we adopt the same data partitioning strategy for all methods (more details described in Section 3.1).

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed all major concerns highlighted by the MR and reviewers. These updates and additional results must be included in the camera ready.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper studies the problem of polyp localization in colonoscopy videos and proposes a spatio-temporal transformation strategy for feature aggregation across neighboring frames. Although all reviewers recommend acceptance of this paper, they also raise significant concerns about the experiments. The rebuttal promises a code release, which would benefit reproducibility. It also includes comparisons against recent detectors with standard metrics and a spatio-temporal model requested by R2 and MR1. The final version of the paper should include those experiments in addition to other comments by the reviewers.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Spatial/temporal transformer has been used in video object detection community in computer vision for a while. As questioned by reviewers, in addition to showing the spatio-temporal transformer application in the video poly detection, it is important to demonstrate the technique novelties. In the author’s response, the performance of the proposed method on the ImageNet VID dataset is 76.4%, which is lower than many recent methods such as HVR (83.2%), MEGA (82.9%), SELSA (80.3%), etc. The best performance of FGFA (80.1%) and RDN (83.8%) can be higher than what are reported by the authors in the rebuttal. Th authors didn’t show the result of MEGA on ImageNet VID, but they showed the result of MEGA on CVCclinic and ASUMayo datasets. Whether or not MEGA and other compared ones are well optimized on these two datasets for fair comparisons is unknown. In the rebuttal, more support frames decrease the performance, which needs some analysis. If the query-key-value works well in the transformer, the less related support frames will not contribute much to the feature aggregation, so more support frames (within the memory cap) should not affect the performance too much.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

back to top

Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation