Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Kai Cheng, Yiting Ma, Bin Sun, Yang Li, Xuejin Chen

Abstract

Depth estimation in colonoscopy images provides geometric clues for downstream medical analysis tasks, such as polyp detection, 3D reconstruction, and diagnosis. Recently, deep learning technology has made significant progress in monocular depth estimation for natural scenes. However, without sufficient ground truth of dense depth maps for colonoscopy images, it is significantly challenging to train deep neural networks for colonoscopy depth estimation. In this paper, we propose a novel approach that makes full use of both synthetic data and real colonoscopy videos. We use synthetic data with ground truth depth maps to train a depth estimation network with a generative adversarial network model. Despite the lack of ground truth depth, real colonoscopy videos are used to train the network in a self-supervision manner by exploiting temporal consistency between neighboring frames. Furthermore, we design a masked gradient warping loss in order to ensure temporal consistency with more reliable correspondences. We conducted both quantitative and qualitative analysis on an existing synthetic dataset and a set of real colonoscopy videos, demonstrating the superiority of our method on more accurate and consistent depth estimation for colonoscopy images.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87231-1_12

SharedIt: https://rdcu.be/cyhUD

Link to the code repository

https://github.com/ckLibra/Self-Supervised-Depth-Estimation-for-Colonoscopy

Link to the dataset(s)

https://github.com/ckLibra/Self-Supervised-Depth-Estimation-for-Colonoscopy

Reviews

Review #1

Please describe the contribution of the paper

In this paper, authors proposed a method to estimate the depth from real colonoscopy video frames using both synthetic and real colonoscopy videos; first, they trained a pix2pix network (DepthNet) in an adversarial manner to learn depth only from synthetic data, second, they used temporal consistency as a constrained to finetune their DepthNet on real colonoscopy videos.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-Using temporal consistency as a constraint on depth estimation from real colonoscopy videos
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-PWC-Net is designed and trained on outdoor video frames, and thus can introduce extra error in computing optical flow from real colonoscopy video frames. -The depth estimation method could be compared with geometrical self-supervised deep learning method as well -It is not clear whether the synthetic data has occlusion map and why it has not been used in the training as it could help as an extra constraint.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

considering the explanation of the method in the paper, it seems to be reasonably reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- In Fig.1. , it would be better to replace FlowNet with PWC-Net, FlowNet itself is an optical flow method.
- Using a gradient of depth could smooth the effect of reflection but as can be seen on the last two columns of Fig.2., the method introduced some noise on the top left of the depth map. This is common when using generative networks to map from RGB to normalized depth.
- Instead of two frames, a sequence of 5 or even 10 frames (based on the hardware capacity) can be used, this can help to deal with the reflection noise as it appears and disappears based on camera movement inside the colon
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Estimating depth to determine the size and location of polyps can be a helpful assistive tool for clinicians. Even though the authors have tried to address this problem, I believe more experiments and comparison with the current SOTA in the computer vision method is necessary, which lack in this paper.
What is the ranking of this paper in your review stack?

5
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The paper presents a novel strategy to estimate depth on colonoscopy images using a deep learning framework. The depth estimation is useful to obtain colonoscopy polyp clues that thereafter determine the aggressiveness of the disease. The main contribution of the proposed strategy is the training of a deep architecture over synthetic and real colonoscopy data which also exploit temporal consistency as constrain to achieve better deep map representations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed approach apport novel components regarding the depth estimation of colonoscopy. For instance, the use of temporal consistency to avoid artifacts in resultant deep maps. Despite the use of synthetic data is widely implemented in the literature, the configuration and transfer scheme may be new on this application.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The main drawback of paper lies in the evaluation and results. I summarize the weaknesses in the following lines:
- There is not a clear statistical difference between the approaches: “our baseline” and “ours” in Table 1. The authors argue that real data not make further improvements over synthetic representation. I consider that a more deep discussion should be done, illustrating also the train results to discard possible overfitting.
- There is not an ablation study to demonstrate the contribution of temporal component
- I am surprised that the results of other baseline approaches are not ilustrated over real sequences. The differences of performance in such dataset should stand out main advantages of the proposed approach. I think that a state-of-the-art comparison is mandatory in this work since in the literature there exist a lot of approximation with the main objective
- With respect to other works, I found that the dataset is limtied and other ground strategies should be considered. In fact, there exist tools during colonoscopy to obtain a better approximation of polyp size.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper has potential to be reproduced because the description of methodology is clear. The code and dataset seems that is not public available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper introduce a novel approach to depth estimation of colonoscopy. A main and intersting contribution is the integration of temporal consistency to achieve better estimations of depth. The use of synthetic data and the retrining with real data is also interesting and could be a potential alternative to achieve robust 3D estimations.

Nonetheless, I consider that evaluation of the proposed approach should be enriched regarding the state-of-the-art and also with respect to the apport of each component in the strategy. For instance, an ablation study may determine the apport of temporal information.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Althought the paper is interesting, the idea is novel and the application is into the domain of the conference, I consider that a further evaluation is demanding to know the advantage and limitations of the proposed approach.
What is the ranking of this paper in your review stack?

6
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper
- Uses both synthetic and real colonoscopy data to train a network for depth estimation from colonoscopy videos.
- Uses temporal consistency and introduces a masked gradient warping loss to train the network in a self-supervised manner.
- Achieves smoother and more consistent depth estimation than other methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Exploiting both synthetic and real data, and training the network in a self-supervised manner: Given the lack of real ground truth data which is the main challenge in the medical field, it is very important to develop a method that works around this problem. This paper attempts it by using synthetic data and self-supervision.
- Masked gradient warping loss: This paper introduces a new loss to enforce temporal consistency between neighbouring frames, which leads to smoother and more consistent depth estimation (but with its own disadvantage - see comments below).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Depth estimation at the sharp boundaries: I can see that the proposed method generates smoother depth at the locations with specularity, but the method makes the depth at sharp boundaries also quite smoother which is not very ideal. On the supplementary video, this depth bleeding at boundaries is more noticeable than in the figures. I think this might be from the gradient warping loss which could make the gradients at the boundaries small.
- No discussions on failure cases/limitations: I believe this will improve the quality of paper in general as well as help other researchers think about how to build upon the proposed method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It looks fine.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- I have a question about Eq 3. Is this loss for pretraining on synthetic data as well as for finetuning on real data? If so, how is the GAN loss computed for finetuning? Unless I misunderstood it, this loss is for training pix2pixHD which requires paired images. So I am wondering what is the paired image for a real image here.
- As mentioned above, it would be great if the authors can add failure cases and discussions on the limitations of the method.
- Now, some minor comments:
- pp2: In the 3rd paragraph, ‘train a image translation’ -> ‘train an image translation’.
- Sec 3.2: CGAN is not defined before.
- pp7: In line 2, ‘This is reasonable because that the self-supervision’ -> remove ‘that’.
- pp7: in line 4, ‘it does not increase more information for the synthetic data.’ -> ‘it does not increase the amount of information…’.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper addresses one of the main challenges in medical imaging, i.e. how to work around the lack of ground truth data in the real domain. This paper does it by exploiting both synthetic and real data in a self-supervised fashion with a novel masked gradient warping loss. This new loss enforces the temporal consistency between neighbouring frames which results in smoother depth estimation. However, there is an adverse effect in this approach: the estimated depth loses fidelity around sharp boundaries. I think this might hinder, e.g. reconstructing the correct 3D geometry of colon.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper describes a method to train a 3D depth estimation network from both simulated and real data. The contribution is a novel loss based on temporal consistency. Training such a network is a currently open problem given the lack of labeled data. This is hence a worthy contribution to publish. The paper should be revised according to the reviews for the final version.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Author Feedback

We thank the reviewers for their constructive comments. Here we would like to address some specific concerns and comments.

[R1, R2] Comparison with SOTA methods. [Reply] We actually had tried several SOTA geometrical self-supervised learning methods such as Zhou et al. [23]. But suffering from the occlusions and non-Lambertian surfaces, the results are poor. The same conclusion can be found in Liu et al.’s work [6]. W.r.t. real colonoscopy videos, since there is no ground truth, it is difficult to make a quantitative comparison. The main contribution of our paper is proposing a novel strategy and loss to achieve more consistent and reasonable depth prediction for real colonoscopy videos without ground truths. The advantage is demonstrated by reconstructed depth maps for real data. We will release our data and code for comparison.

[R1] Error from optical flow. [Reply] Due to the inevitable errors in optical flow estimation, we filter most pixels that have high optical flow errors by forward-backward consistency check (as described in the second paragraph of Sec. 2.2). Besides, the optical flow does not explicitly constrain the depth values but only provides extra temporal information for training, which is tolerant to small errors. Moreover, the optical flow is used only on the training stage, thus it does not introduce extra errors on the inference stage.

[R1] Occlusion map of the synthetic data. [Reply] In the synthetic dataset released by Rau et al. [11], only the paired RGB images and depth maps are provided. The occlusion maps are not available.

[R2] Ablation study and analysis of the contribution of temporal component. [Reply] According to our understanding, the temporal component refers to the elements in the depth gradient warping module, such as using the gradient of depth instead of depth, forward-backward warping distance check. We did an ablation study before, but due to the limitation of the number of pages in MICCAI, there may not be enough space for discussion on ablation study.

[R2, R3] The limitations of the proposed approach. [Reply] Due to the limitation of the page, we did not discuss much about the limitations of our method. We did test our method on a large number of real colonoscopy videos. In general, our method produces reasonable depth estimation in a general colonoscopy environment. But it occasionally fails to recover the shape of small polyps that are difficult to distinguish from the background. Further discussions will be added to the paper.

[R3] The adverse effect of the gradient loss. [Reply] The temporal consistency of the predicted depth is much more important than sharp boundaries for multi-view 3D reconstruction in our paper. Our framework is flexible to integrate existing solutions to strengthen boundaries, such as emphasizing the depth estimation error at boundaries on synthetic data.

[R3] GAN loss for finetuning. [Reply] The GAN loss is computed with paired images from the synthetic dataset. It is not computed on the real data. During the finetuning stage, both synthetic data and real data are used to compute the loss in Eq. 3, while L_MGW is computed with real data and L_FM, L_GAN are computed with synthetic data.

back to top

Depth Estimation for Colonoscopy Images with Self-supervised Learning from Videos