Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Guodong Zeng, Till D. Lerch, Florian Schmaranzer, Guoyan Zheng, Jürgen Burger, Kate Gerber, Moritz Tannast, Klaus Siebenrock, Nicolas Gerber

Abstract

Unsupervised domain adaptation (UDA) for cross-modality medical image segmentation has shown great progress by domain-invariant feature learning or image appearance translation. Feature-level adaptation based methods learn good domain-invariant features in classification tasks but usually cannot detect domain shift at the pixel level and are not able to achieve good results in dense semantic segmentation tasks. Image appearance adaptation based methods translate images into different styles with good appearance, but semantic consistency is hard to maintain and results in poor cross-modality segmentation. In this paper, we propose intra- and cross-modality semantic consistency (ICMSC) for UDA and our key insight is that the segmentation of synthesized images in different styles should be consistent. Specifically, our model consists of an image translation module and a domain-specific segmentation module. The image translation module is a standard CycleGAN, while the segmentation module contains two domain-specific segmentation networks. The intra-modality semantic consistency (IMSC) forces the reconstructed image after a cycle to be segmented in the same way as the original input image, while the cross-modality semantic consistency (CMSC) encourages the synthesized images after translation to be segmented exactly the same as before translation. Comprehensive experiments on two different datasets (cardiac and hip) demonstrate that our proposed method outperforms other UDA state-of-the-art methods by a large margin.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_19

SharedIt: https://rdcu.be/cyl3V

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper describes a novel loss formulation for the unsupervised domain adaptation in the context of dense segmentation tasks. Specifically, the authors describe an intra- and cross-modality semantic consistency (ICMSC) loss to ensure that segmentation results are invariant to whether the original image (source domain), the reconstructed image (target domain) or the reconstructed image after a full cycle (source domain) is segmented. The authors employ two distinct segmentation networks (one for either domain) that can be trained end-to-end with the image translation network (CycleGAN).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Results are convincing and improvements non-trivial. The method is evaluated on two datasets (MRI->CT, CT->MRI), with two relevant evaluation measures (Dice, ASD) and compared to relevant prior work (CyCADA, SIFA)
- Convincing ablation study confirms the benefit of each loss component
- Convincing quantitative but also qualitative results (Fig. 4)
- Method addresses an important research question, is intuitive and can be trained end to end
- The contribution is clearly described, visualised (Fig. 1) and the paper generally good organized.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- It is insufficiently clear how the reference results were obtained. E.g. which method was re-implemented, which was publicly available. It seems that e.g. for SIFA results were taken from paper [10]?
- Given that results seem to be (at least partially) taken from the SIFA paper [10] it is unclear whether all reference methods were trained on the same images
- Lack of comparison to a relevant reference method [14] which was presented at MICCAI 2019.
- Lack of significance tests on results. If it is claimed that a method is significantly better, significance tests should be done for the best results shown in bold.
- Why are results on both datasets only reported for one ‘direction’? It would have been great to see summarized results for MRI->CT (on hip) and CT->MRI (on cardiac).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The provided details in the supplementary material (parameters, network architecture, implementation …) are appropriate for reproducibility. However, reported results in the paper lack error bars.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Differences of the presented method to [10] and [14] should be further discussed
- It would be have been great (and reassuring) to train the network for the CT->MRI task on the cardiac dataset to allow further comparison to SIFA and other state of the art.
- The segmentation loss should be clearer described (e.g. how is cross-entropy and dice combined?)
- Please clarify why only 18 out of the 29 CT scans were used for training on the hip dataset.
- typo: CyADDA (p.7)
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

see above. Main points: Intuitive and clear method that yields non-trivial, convincing performance improvements on two datasets and two performance measures.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper introduces a new method for performing unsupervised domain adaptation which builds closely on previous works such as CyCADA. The resulting model uses at least five loss terms. The largest difference to CyCADA is the introduction of an inter-modality and cross-modality consistency loss. The paper is validated using two separate medical imaging tasks; cardiac segmentation and hip joint segmentation. Results show that the methodological contribution leads to an improvement in both datasets and an ablation study justifies the introduction of each loss.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The clarity and presentation of the paper is excellent. With a great methods figure which helps simplify a complex landscape of losses, generators, discriminators and segmentation networks.

There are results on two different datasets with very different parts of the body. Not often found in MICCAI papers.

There is an ablation study to justify the use of many loss functions and give the reader an understanding of the individual contributions of each one.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Training multi-loss models and comparing them is always tricky, unfortunately the author’s had to implement some of the competing methods when code was not available. A challenge in understanding whether there is a clear improvement over other methods is understanding to what extent the authors tried to make the competing methods work. The qualitative results for the vanilla CycleGAN on the cardiac dataset look qualitatively worse than results from other works such as “Translating and Segmenting Multimodal Medical Volumes with Cycle- andShape-Consistency Generative Adversarial Network” although it is hard to draw such comparisons without a quantitative measure.

Would have liked for the author’s to include a discussion around the stability of training. A network with 5 loss terms has many hyperparameters, it would be good to understand how hard it was to find all the lambda values and the training schedule used. Some details are included in the supplementary material, but this section could be padded out with more detail.

A validation split is mentioned, but it is not clear whether the training was stopped based on performance on a validation set. This is a common pitfall of unsupervised domain adaptation algorithms, as they make use of target labels to train the models or find hyperparameters.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author’s used a private Hip Joint segmentation dataset but also included results using a publicly available cardiac segmentation dataset. The author’s did not include code along with this submission nor mention whether code would be provided at a future date. I hope if accepted they do!
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

As mentioned in the weaknesses it would be good to understand how stable and sensitive to hyperparameters the training is, so that researchers can adopt this method for their own data. Some of the competing methods might have not been trained optimally, but if this was very difficult then it can be mentioned. If it takes a larger amount of computational effort and time to find the right hyperparameters for a competing method it is interesting for the reader to know.

I have no constructive comments on the overall presentation of tables, figures and text as it is excellent.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper provides a simple extension to a well established and understood method for unsupervised domain adaptation through cycle-consistency (CyCADA) and shows it works on two medical imaging datasets. The method is very well explained and thoroughly validated.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

1
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The paper descripes an unsupervised domain adaptation approach with a segmentation task. A CycleGAN (adaptation on image-level) is used and two new loss terms called IMSC and CMSC are proposed. The approach is evaluated on two different applications (cardiac and hip).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and understandable. The approach was evaluated on two different applications. An unsupervised DA is proposed, which is clinically more relevant than a supervised solution.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Lack of discussion of recent literature on UDA + segmentation, especially w.r.t the same dataset. This makes the approach not easily comparable to existing approaches.
- No crossvalidation was performed
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

One of the datasets used for evaluation is publically available. The description of the evaluation is good, therefore I would rate the reproducibilitly as satisfactory. Authors should consider to release their source code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

For me, what is currently lacking in the manuscript is a discussion on papers which were published in the field of unsupervised domain adaptation (also together with the MMWHS dataset). There is a plethora on work about UDA with a segmention task (e.g. Toldo et al. “Unsupervised Domain Adaptation in Semantic Segmentation: a Review”, provides a nice review) that goes beyond CycleGAN. I cannot easily assess whether the proposed method/loss is actually novel. However, the method itself looks elegant.

Related recent papers that have been published on DA/UDA with the same dataset are not referenced and discussed: Bian et al., Uncertainty-aware domain alignment for anatomical structure segmentation, MedIA Li et al., Dual-Teacher++: Exploiting Intra-domain and Inter-domain Knowledge with Reliable Transfer for Cardiac Segmentation

No crossvalidation was performed.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There is already a lot of work published on UDA with a segmentation task. For me, the paper does not clearly compare to recent other works in this field and therefore it is a borderline paper. If the authors are able to present a good reasoning after revision, I would be willing to change the initial assessment. Furthermore, crossvalidation values should be presented to decrease test set selection bias.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper received generally very good reviews with reviewers commenting on convincing results, and the clarity of the manuscript. However, some reviewers comment negatively about the experiment setup (no cross-validation, no significance tests) and mention a lack of discussion of related works. In particular R3, points out a large number of recent relevant works in the field of unsupervised domain adaptation which should be discussed or compared to.

Despite the shortcomings pointed out by R3, I follow the majority of the reviewers and suggest early acceptance. Please address the points raised by the reviewers (and in particular R3) as well as possible in the final submission.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

We thank the AC and all reviewers for their comments.

R#1: Partial reference experimental results in Table 2 are taken from the SIFA paper [10]. Since our method was trained on the MMWHS dataset with the same original official data split in the SIFA paper, we directly used their reported performance for a fair comparison.

R#1: no experiment results for MRI->CT (on hip) and CT->MRI (on heart). In this conference paper, we only performed experiments in one direction of cross-modality on two different datasets, i.e., CT->MRI (on hip) and MRI->CT (on heart). The experiments for cross-modality in two directions (MRI->CT and CT->MRI) on both datasets (hip and heart) will be included in a future journal paper.

R#1, R#3: comparison with other SOTA methods. Our method compares favorably with several SOTA methods for unsupervised domain adaptation as presented in [3, 6, 10, 15]. The experimental results on two different datasets (hip and heart) demonstrated the effectiveness of our method.

R#1: why 18 out of the 29 CT scans were used for training on the hip dataset. On the hip dataset, we used a data split of 60% for training and the remaining 40% for blind testing.

R#1: no significance tests of the results. In the data split of the SIFA paper [10], 20% of the dataset (4 CT scans) was used as unseen test data on the heart dataset. The number of test cases is too small to perform meaningful calculation of significance statistics. But the remarkable improvement of the average Dice value (8%) shows that our method is much better than other SOTA methods.

R#1: the definition of the combined cross-entropy and dice loss. We have not given the specific equation because the combined cross-entropy and dice loss is a very commonly known loss function. However, we have already referred the readers to paper [16] where specific equation about the combined loss can be found.

R#1: Typo: CyADDA (p.7). We will correct it in the revision.

R#2: the performance of the vanilla CycleGAN on the heart dataset. We directly used the official code of the original CycleGAN and followed its training strategy. We did our best to train the model and reported the best performance that we could achieve.

R#2: sensitivity to the different lambda values in the loss terms. We did not do much fine tuning on different lambda values for the loss terms, we used the default values in the supplementary material so that each weighted loss term is in a similar range and we found experimentally that the training was quite stable.

R#2: when to stop training and target labels used during training. We trained our model for a total of 120 epochs as illustrated in the supplementary material, and the target label was never seen during the training phase.

R#3: no cross-validation study was conducted. We will perform a k-fold cross-validation study in a future journal paper. For the cardiac MMWHS dataset, we used the same data split as in the SIFA paper[10] to have a fair comparison with other methods. And the experimental results on another hip dataset also showed the efficacy of our method.

R#3: lack of references and discussion for current literature. We will add these references below in revision and we will have more discussion in the future journal paper because the space in this conference paper is limited. “Unsupervised domain adaptation in semantic segmentation: a survey.” “Bian et al, Uncertainty-aware domain alignment for anatomical structure segmentation, MedIA” “Li et al, Dual-Teacher++: Exploiting Intra-domain and Inter-domain Knowledge with Reliable Transfer for Cardiac Segmentation,TMI”

back to top

Semantic Consistent Unsupervised Domain Adaptation for Cross-modality Medical Image Segmentation