Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Tianfei Zhou, Liulei Li, Gustav Bredell, Jianwu Li, Ender Konukoglu

# Abstract

Despite recent progress of automatic medical image segmentation techniques, fully automatic results usually fail to meet the clinical use and typically require further refinement. In this work, we propose a \textit{quality-aware memory network} for interactive segmentation of 3D medical images. Provided by user guidance on an arbitrary slice, an interaction network is firstly employed to obtain an initial 2D segmentation. The quality-aware memory network subsequently propagates the initial segmentation estimation bidirectionally over the entire volume. Subsequent refinement based on additional user guidance on other slices can be incorporated in the same manner. To further facilitate interactive segmentation, a quality assessment module is introduced to suggest the next slice to segment based on the current segmentation quality of each slice. The proposed network has two appealing characteristics: 1) The memory-augmented network offers the ability to quickly encode past segmentation information, which will be retrieved for the segmentation of other slices; 2) The quality assessment module enables the model to directly estimate the qualities of segmentation predictions, which allows an active learning paradigm where users preferentially label the lowest-quality slice for multi-round refinement. The proposed network leads to a robust interactive segmentation engine, which can generalize well to various types of user annotations (\eg, scribbles, boxes). Experimental results on various medical datasets demonstrate the superiority of our approach in comparison with existing techniques.

SharedIt: https://rdcu.be/cyl23

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

The paper presents an interactive segmentation framework based on a pre-trained neural network that has a “memory augmentation” component that encodes a number of segmented patches as key-value pairs which are fetched based on their similarity to the slice being segmented. The network firsts segments the slice with user interaction and then propagates the segmentation onto adjacent slices and so-forth using the memory network until the volumetric image is segmented, using a quality estimate to identify the slice with the lowest quality propagation which is marked as the best locus for additional user interaction. (The benefit of this method is that it avoids run-time training while still allowing for adaption via storing more slices in memory.)

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper is in showing the effect of the effect of the quality assessment metric for slice selection. The two alternative methods, random and oracle, do pose meaningful upper and lower bounds on the expected performance.

The other main strength of the paper is the attention towards ensuring that the underlying learning mechanism can be used in near real-time, i.e. the structure of the memory mechanism is designed for interactivity and having the network adapt to the user without expensive weight re-training in run-time. The application of the method on two vastly different anatomies illustrates this as well, although this could always be improved by looking at different imaging modalities as well.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The authors really need to check their references. On the second page, the authors cite [10,18,20,24] as handling only 2D medical images which is untrue for [18] as it uses 3D volumes via 33x33x3vox patches. (The remainder do use 2D slices.)

Table 1 should include some variability measurements and not just the average performance. Given how close the numbers generally are, it is unclear if the proposed method is offering a consistently improved segmentation or not.

It is unclear if a single f_IN is trained with respect to the interaction mechanisms. Does each one use a different f_IN? A shared f_IN? This could go a long way towards explaining the results in Table 1.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper appears to be relatively reproducible in that the logic of the paper is easy to follow. There are a few aspects that could be improved, such as an exact description of the networks used, or code to that effect. The datasets are also publicly available, although not their scribble annotations.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

It is slightly unclear in Figure 3 exactly how the user annotations vary with the number of rounds. (Also, there is a typo in the Figure 3 legend.) One assumes different slices are used for the different time-points and that each slice is interacted with at most once, but this should be explicit and the ordering mechanism stated. The authors should also consider what would happen if the same slice is re-annotated (for correction purposes) and how that factors into their framework, especially given that the better performing interaction mechanisms are both designed more to give a notion of bounds and have less corrective capabilities.

The paper would benefit from some statistical testing in order to show if the results can be readily explained by random perturbations. Assuming each method uses the same training-testing split, Wilcoxon signed-rank tests would likely by the appropriate choice.

On a cosmetic note, the citations should be arranged in numerical order to make it easier for the reader.

strong accept (9)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an interesting framework and particular attention has been given to investigating different annotation types, which is appreciated for a work on interactive segmentation. The paper also presents a memory mechanism that one can easily envisage as improving interactive segmentation without requiring interactive network training. Ultimately, this paper appears to balance technical novelty with a clear underlying technical need, motivated by a reasonably foreseeable clinical need.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

The authors propose an interactive method for 3D image segmentations based on memory and quality assessment. The method is fast and works on several public databases.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• strong validation
• the method proposed is fast even if in 3D
• description of experiment are clear
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• I would have appreciate to see the std in the quantitative results
• the conclusion is maybe light
• the results of the challenges (Kits for example) are not really explained (“competitor” is the winner? what about the new submissions?)
• no comparison with ResNet-50 alone (backbone of the proposed method)
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The database are public so the access to the data is not a problem. The author will share the codes. The details in the article seem sufficient to be able to reproduce the results.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

It would be interesting to see the confidence the network has with the memory component and without. A statistical study of the results is hence recommended. A mention to today’s challenge results would also be appreciated. I would have appreciated a comparison with ResNet-50 without the memory module to really understand its role.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Even if the method is not fully automatic, results are interesting. The memory component is a real advantage when segmenting a 3D volume slice by slice.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

4

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

This paper presents a new interactive segmentation method which acts on volumetric medical images. Compared to the baseline approach based on nn-UNet, the proposed method consists of a memory-augmented network and a quality assessment module, which allows users to label lower-quality slices for refinement. The proposed method yields superior results compared with existing methods.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The paper presents a novel approach for interactive volumetric segmentation which enables quality assessment for multi-round refinement in the human-in-the-loop segmentation process.

• The approach can well accommodates different weak labels, e.g., scribbles, points, etc.

• The approach yields superior results than existing methods, including competition winner nn-UNet. Detailed ablation study on the quality assessment module is also provided to show its effectiveness.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• Comparison with weakly-supervised methods (e.g., [a, b]) may also need to be provided.

• Missing references: the ROI-based multi-round refinement is very similar to a coarse-to-fine based volumetric segmentation approach [c]. Should be included in the discussion.

[a] Rajchl, Martin, et al. “Deepcut: Object segmentation from bounding box annotations using convolutional neural networks.” IEEE transactions on medical imaging 36.2 (2016): 674-683. [b] Kervadec, Hoel, et al. “Constrained-CNN losses for weakly supervised segmentation.” Medical image analysis 54 (2019): 88-99. [c] Zhou, Yuyin, et al. “A fixed-point model for pancreas segmentation in abdominal CT scans.” MICCAI 2017.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors promise to release the code.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents an interactive volumetric segmentation tool which can well accommodate different weak labels, and also enables quality assessment for multi-round refinement. The authors perform evaluation on CT segmentation datasets and compare with a few strong competitors to demonstrate the clinical feasibility. The output might be a general tool to reduce clinicians’ workload.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work presents a general framework for volumetric segmentation which can account for different types of weak labels. The method has thoroughly validated and the paper is clearly written.

The reviewers recommend to include standard deviations in the reported results and to expand the revised related works with some relevant literature which is currently missing.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

# Author Feedback

We sincerely thank all the reviewers and AC for their time and constructive comments on our manuscript. Below we provide clarifications on the major comments, which will be integrated in the final version.

Reviewers #2 & #3 Q: Provide the std in Table 1. A: We summarize the standard deviations of different algorithms (Interactive 3D nnU-Net/DeepIGeoS/Ours) in the table below. As shown, our method has the smallest deviations. For reference, the best solution in the MSD leaderboard has the deviations 18.9 for lung cancer and 35.1 for colon cancer. Interaction Type—–Lung Cancer—–Colon Cancer—–Kidney Organ—Kidney Tumor Scribbles————16.8/13.5/9.2—–34.7/19.5/11.7—-4.0/3.4/1.9——15.9/14.3/7.5 Bounding Boxes—-16.3/13.3/9.1—–33.2/19.2/11.4—-3.8/2.8/2.1——15.8/13.0/7.5 Extreme Points—–15.5/12.6/8.8—–31.0/19.1/11.2—-3.1/2.4/1.7 ——14.4/11.4/7.4

Reviewers #2 & #3 & #4 Q: Incorrect citations and missed reference. A: Thank you for pointing this out. We will correct the citation errors, expand the reference section to include the missed articles and rearrange the references in numerical order.

Reviewer #2 Q1: Is the interaction network f_In shared across different interaction types or not? A1: In the article, we leverage a different interaction network f_In for each type of interaction to achieve better performance. Following your suggestion, we designed a shared f_In by combining all training samples of different interaction types for training. We found that the performance degrades slightly with the shared f_In, however it would then support diverse interaction mechanisms. Below we summarize the DSC scores and the difference to the reported results in Tab. 1. Interaction Type—Lung Cancer—Colon Cancer—Kidney Organ—Kidney Tumor Scribbles———-80.5 (-0.4)—–79.3 (-0.4) ——-96.6(-0.3)——-87.6 (-0.6) Bounding Boxes–81.2 (-0.3)—–78.8 (-0.5)——–96.8(-0.2)——-87.9 (-0.5) Extreme Points—81.4 (-0.6) —–79.8 (-0.6)——-96.5(-0.5)——-88.0 (-1.1)

Q2: How do the user annotations vary with the number of rounds? A2: At each round, we determine the slice to be annotated based on the quality scores predicted by the quality assessment network, i.e., the slice with the lowest quality score will be chosen. In the current version, each slice will be chosen at most once. We agree with the reviewer that it is valuable to factor the correction procedure into the framework and will consider this in the future work. Thanks.

Reviewer #3 Q1: More ablative experiments on the memory component. A1: To study the memory module, we conduct extra experiments to evaluate the model performance with different memory sizes (i.e., 0, 1, 5, 10, 15, 20). Here, 0 corresponds to the baseline model with ResNet-50 only, while 20 is the default size used in the paper. In the following table, we report the DSC scores on the three challenging test sets across all interactions (scribbles/bounding boxes/extreme points). We observe that without the memory module, the performance encounters a severe degradation for all interactions, especially on Lung Cancer and Colon Cancer. In addition, the performance improves with larger memory sizes, and the gain becomes marginal around the values of 15 and 20. Memory Size—Lung Cancer—–Colon Cancer —–Kidney Tumor 0—————-58.2/59.3/58.9—54.7/54.7/54.8—–79.7/79.7/79.8
1—————-76.2/75.6/77.0—67.3/67.1/68.0—–83.7/83.7/84.2 5—————-79.6/79.8/80.9—72.9/73.1/73.9—–86.1/85.8/86.9 10————–80.9/81.4/81.8—-75.2/75.3/75.8—–87.2/87.3/87.9 15————–81.0/81.5/82.1—-78.7/79.2/80.4—–88.3/88.3/89.0 20————–80.9/81.5/82.0—-79.7/79.3/80.4—–88.2/88.4/89.1

Q2: Other minor issues (e.g., expand the conclusion section and include more challenge results) A2: Thank you for the comments. We will definitely revise them in the final version.