Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Ivan Zakazov, Boris Shirokikh, Alexey Chernyavskiy, Mikhail Belyaev

# Abstract

Domain Adaptation (DA) methods are widely used in medical image segmentation tasks to tackle the problem of differently distributed train (source) and test (target) data. We consider the supervised DA task with a limited number of annotated samples from the target domain. It corresponds to one of the most relevant clinical setups: building a sufficiently accurate model on the minimum possible amount of annotated data. Existing methods mostly fine-tune specific layers of the pretrained Convolutional Neural Network (CNN). However, there is no consensus on what layers are better to fine-tune, e.g. the first layers for images with low-level domain noise or the later layers for images with high-level domain noise. We propose SpotTUnet – a CNN architecture for supervised DA in medical image segmentation. On the target domain our method additionally learns the policy that indicates whether a specific layer should be fine-tuned or it should be reused from the pretrained network. We show that our method performs at the same level with the best non-flexible fine-tuning methods even under the extreme scarcity of annotated data. Secondly, we show that SpotTUnet policy provides a layer-wise visualization of a domain shift impact on the network, which could be further used to develop a more domain shift robust method. We use a publicly available dataset of brain MR images (CC359) with the explicit domain shift to extensively evaluate SpotTUnet performance and release a reproducible experimental pipeline.

SharedIt: https://rdcu.be/cyl3W

# Reviews

### Review #1

• Please describe the contribution of the paper
• This study employs a Transfer learning (TL) approach “SpotTune” to a supervised domain adaptation (DA) setup and demonstrated the difference in the optimized fine-tune policy strategy between the TL and DA tasks.
• The authors proposed an L1 regularization parameter and reported its dependency on the number of limited annotation samples.
• The authors presented a visualization about the layer-wise fine-tune policy for the proposed supervised domain adaptation problem
• Optimize strategies for different levels of data availability
• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The experimental design is thorough, tested different cases with a various number of limited training labels.
• The proposed regularization parameter lambda showed effectiveness to fit experiments with different number of annotated slices
• The author proposed a novel layer-wise visualization to indicate the layer-wise fine-tuning policy optimization, which showed that the blocks from the encoder part of U-Net are more likely to be fine-tuned
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• On page 4, Section 3 method: the explanation of the regularization could be better explained. The author mentioned “Global-0 SpotTune case”, but there is no explanation about what this “Global-0 SpotTune case” is.
• On page 6 section 4.1, the author separated one domain “Siemens 1.5T” for validation, to “avoid overfitting”. It is not clear to me why it can avoid overfitting. It is also not clear to me why leaving one domain as the validation set, instead of choosing a subset of data from each domain for validation. That way, the validation shall be more representative.
• In the introduction, the author cited previous papers showing fine-tuning either the first or the last layer. In the conclusion, the author hypothesis that the domain adaptation problem shall fine-tune early layers, and the “transfer learning” problem shall fine-tune later layers. While this conclusion make sense intuitively, the author only included the experiment for “fine-tuning first layer” in the experimental design. It would be better to include the experiments to only fine tune the later or last layer, or at least why that experiment is not necessary with the help of the results showing on Figure 4.
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
• The description of the cohort and sample number is clear
• The code will be available upon acceptance The description of the experiment hyperparameter and training/valid is clear. My comments about the split, however, is included in the comments in the above “weakness” section.
• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
• It would be good to add a brief introduction about the Surface Dice Score which is introduced in the reference [11], since it’s not as commonly used as other metrics such as Dice score (DSC) or Mean surface distance (MSD)

• In figure 3, it would be better to perform statistical analysis to indicate whether the improvements from proposed SpotTUnet is significant

• English need to be proofed • “4.1 with learning rate reduce it to” shall be rephrased • Page 7, section 4.3 “Consequently, the later may indicate, that the previous to the frequently fine-tuned layers feature maps contain a large domain shift.” doesn’t read well, and need to be rephrased. ○ Maybe: “Consequently, the later may indicate that, the previous to the frequently fine-tuned layers feature maps contain a large domain shift.”

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• The proposed layer-wise visualization is novel to indicate the layer-wise fine-tuning policy optimization. It clearly shows that the blocks from encoder part of U-Net are more likely to be fine-tuned
• The experimental design is thorough, tested different cases with the various number of limited training labels.
• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Confident but not absolutely certain

### Review #2

• Please describe the contribution of the paper

The paper presents a flexible fine-tuning model for segmentation adaptation in the context of limited (but still available) target domain annotations. The authors’ SpotTUnet architecture performs similarly to “Fine-tune first layers”

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The concept of learning which particular layers should be fine-tuned in a pretrained network is of note. The extension to segmentation of the SpotTune network is novel.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The results are mixed at best and don’t reflect any improvement over the “Fine-tune first layers” of [12], at least when there is some data scarcity in the target domain.

• Please rate the clarity and organization of this paper

Satisfactory

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper provides high reproducibility, with code release and detailed instructions for reproducing the experiments.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

If there is less data scarcity then fine-tune more layers seems pretty intuitive, as well as getting better scores the more target labelling is available. However, your visualizations in Fig 3, 4 seem inconclusive at best. You show fine-tune first layers has better performance for 24,90,270, but worse for 8,45,800. Your explanation (“We further hypothesise, that SpotTUnet policy visualization…”) is difficult to read.

Small issues:

providing therefore a ”domain shift stamp” – use  for left quote. tackle domain shift problem differently – problems plural. in case of annotated Target data scarcity – the case of annotated Target … There are a lot of extra commas. “It is assumed, that low-level features”, “However, it is not clear, whether”, “We further hypothesise, that”, “We find, that the blocks from encoder“ shows that the architecture variations nor the training procedure variations, e.g. augmentation, do not affect – odd phrasing. How about “neither architecture nor training procedure variations (e.g. augmentation) affect…” pretrained baseline models with the correspondence to the current Source domain. – how about “pretrained baseline models corresponding to the current Source domain.” Secondly, we search for optimal λ – the sentences that follow are confusing or redundant. Is “amount of annotated data from Target domain” the same thing as “ data scarcity”? If so then just define it. Maybe we determine \lambda through grid-search for each level of Target domain data scarcity values of \tau at one of the first steps – as one of the first steps methods under the all Target data scarcity setups – remove “the” In Fig 4, looking at the original resolution encoding layers, it’s unclear why the last convolution is never fine-tuned, but the skip connection is. I understand the skip connection includes a 1x1 kernel convolution.

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the work is of some interest and includes novelty in design, the results do not appear significant, with similar performance to “Fine-tune first layers” with just more effort.

• What is the ranking of this paper in your review stack?

4

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

The authors of the paper introduced a SpotTUnet or SpotTune adaptation for supervised domain adaptation problem in medical image segmentation. The proposed approach has been evaluated on a dataset of MR brain images with the explicit domain shift.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• Nicely written paper with strongly formulated motivation and technical contributions
• The authors of the paper are first to use SpotTune for the segmentation task in medical imaging and study optimal fine-tuning strategies for different data scenarios
• A new interpretable regularization is introduced for the SpotTune
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• no major weaknesses
• Please rate the clarity and organization of this paper

Excellent

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Everything needed for reproducibility of the proposed approach is provided

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
• Overall Nicely written paper with strongly formulated motivation and technical contributions. This method would definitely be of interest for the community

• Would be interesting to see how well this approach would work on other imaging modalities and anatomies.

strong accept (9)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Nicely written paper with strongly formulated motivation and enough technical contributions in order to be accepted. The proposed approach is also nicely evaluated and shows that it is able to preserve the quality of the best fine-tuning methods.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work relies on a previously published method, the SpotTUnet, for transfer learning capable of learning which layers should be fine-tuned in the transfer process. The contribution of this work relies on adapting SpotTUnet to the problem of domain adaptation by introducing a new interpretable regularization term, which according to the authors allows the proposed method to perform best when compared against other fine-tuning methods. This claim, however, is not supported by the presented results. As reported in Fig 3. SpotTUnet is not always the best method and, where it is, the results do not appear to be significant. The authors should comment on this and better justify or lower the claims.

As highlighted by one reviewer, while one of the main contributions of this work is the study of the policy behavior (of layer fine-tuning), the results (e.g. Fig 4) are in contradiction with what was observed in the original work (ref 12) and the provided explanations are rather vague. The authors are recommended to address this point.

Lastly, please explain what do they mean by surface Dice score. It is odd to introduce different measures from the standard ones for comparison. Also, please revise the references. Many are incomplete or are pointing to the arxiv rather than to the correct venue/journal.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

# Author Feedback

We thank the reviewers for their thoughtful feedback! We find all the suggestions to be extremely useful for improving the overall quality and clarity of the paper. Below we address the most important reviewers’ concerns.

1. Most notably, @R2 claims, “The results do not appear significant, with similar performance to “Fine-tune first layers” […]”, which is echoed by @MR, “SpotTUnet is not always the best method and, where it is, the results do not appear to be significant”.

However, our message regarding SpotTUnet performance is different and was well appreciated by @R1 and @R3. We do NOT claim that SpotTUnet is the best method among the others. Instead, we state that SpotTUnet performs on the same level with the best of the other methods regardless of the degree of data scarcity, thereby providing trustworthy optimal fine-tuning policy. Given the flexibility of our method, we eliminate the need for manual switching between various methods for different scenarios (e.g., “fine-tune first” for 24 available slices, and “fine-tune all” for 800 slices). Our claims were outlined in the text on several occasions (8th sentence in Abstract, the second contribution in Sec. 1, and the last sentence of the second paragraph in Sec. 4.3). The method received positive feedback from @R1 (“The proposed layer-wise visualization is novel … The experimental design is thorough”) and @R3 (“The proposed approach is also nicely evaluated and shows that it is able to preserve the quality of the best fine-tuning methods”).

Moreover, evaluation of the results gives us statistically significant evidence, confirming that SpotTUnet performs either better than or equally good with the other methods (the results are provided in the same anonymous Github repository).

Therefore, we argue that the claims of @R2 and @MR, outlined above, contradict our main message.

1. Furthermore, @MR claims, “The results (e.g. Fig 4) are in contradiction with what was observed in the original work (ref 12).”

Firstly, our guess is that perhaps @MR assumes (ref. 6, SpotTune) instead of (ref. 12) to stand for the “original work”. In this case, indeed, the results are different: authors of (ref. 6) show the later layers to be more frequently fine-tuned for Transfer Learning task in the natural images domain, while we show the earlier layers to be more frequently fine-tuned for Domain Adaptation task in the medical imaging domain. The related works discussed in Sec. 2 have already outlined the problem of the layers choice strategy and the diversity of possible solutions. In our work, we propose the automatic layers choice strategy that confirms the motivation of fine-tuning the earlier layers in the considered task.

Secondly, if @MR assumes (ref. 12) as the “original work”, we note that our findings exactly match the results of (ref. 12). Along with “fine-tune last layers” being the weakest method in (ref. 12), in our case the last layers are rarely fine-tuned (Fig. 4).

Also, @R1 fairly notes, “It would be better to include the experiments to only fine tune the later or last layer”. We have intentionally removed these results from the paper, since authors of (ref. 12) had already shown “last layers” to perform significantly worse, which we ensured by reproducing the experiments.

1. Finally, @R1 suggests, “It would be good to add a brief introduction about the Surface Dice Score […] since it’s not as commonly used as other metrics”, which @MR also highlights, “Please explain what do they mean by surface Dice score. It is odd to introduce different measures from the standard ones”.

We used the surface DSC (rather than volumetric DSC) as being more sensitive and representative in our task of brain segmentation. “The edge zones account for a small share of the brain volume, which makes dice score not sensitive enough to the delineation quality” (ref. 12).

We mostly attribute the other reviewers’ concerns to the proofreading stage.

Yours sincerely, The Authors

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work adapts an idea from the computer vision community for image classification (ref 6) to the problem of medical image segmentation. To this end, the authors use a U-Net in their formulation rather than a Res-Net. The original work (ref 6) also proposes a visualization scheme, which here is adapted to be displayed in a UNet topology. Based on the clarifications provided by the authors in the rebuttal, I consider the contributions of this work incremental because: 1) The methodological contribution is limited, this being an adaptation of ref 6 to the problem of image segmentation. For instance, the methods section is a summary of the framework, with all the key elements having been introduced by ref 6. The novelty relies in the use of the u-net 2) The experimental findings are a confirmation of what ref 12 has already presented

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper was rated highly by the reviewers, with a single low score (4) by R2. The authors wrote a strong rebuttal addressing many of R2’s points. In particular,

• it was adequately answered why the proposed method doesn’t always beat the baseline (was not the objective),
• whether the work contradicts prior work (in line with [12], but not with [6]. Additionally, the fact that tuning different layers is beneficial for different problems is part of the problem statement),
• and why the unusual Dice surface metric was used (more sensitive since mistakes mostly happen at the edges)

In my view, taking into account the original reviews and the rebuttal, this is a clear accept.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper presents a flexible fine-tuning strategy that could potentially benefit many other medical image segmentation tasks where there is no sufficient labelled data available. Most of reviewers’ comments are well addressed in the rebuttal. I found there are many repetitions in the paper. The authors should remove them to improve the overall presentation of the paper.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1