Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yunlong Zhang, Chenxin Li, Xin Lin, Liyan Sun, Yihong Zhuang, Yue Huang, Xinghao Ding, Xiaoqing Liu, Yizhou Yu

Abstract

This paper investigates the problem of pseudo-healthy synthesis that is defined as synthesizing a subject-specific pathology-free image from a pathological one. Recent approaches based on Generative Adversarial Network (GAN) have been developed for this task. However, these methods will inevitably fall into the trade-off between preserving the subject-specific identity and generating healthy-like appearances. To overcome this challenge, we propose a novel adversarial training regime, Generator versus Segmentor (GVS), to alleviate this trade-off by a divide-and-conquer strategy. We further consider the deteriorating generalization performance of the segmentor throughout the training and develop a pixel-wise weighted loss by muting the well-transformed pixels to promote it. Moreover, we propose a new metric to measure how healthy the synthetic images look. The qualitative and quantitative experiments on the public dataset BraTS demonstrate that the proposed method outperforms the existing methods. Besides, we also certify the effectiveness of our method on datasets LiTS. Our implementation and pre-trained networks are publicly available at https://github.com/Au3C2/Generator-Versus-Segmentor.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87231-1_15

SharedIt: https://rdcu.be/cyhUG

Link to the code repository

https://github.com/Au3C2/Generator-Versus-Segmentor

Link to the dataset(s)

https://www.med.upenn.edu/cbica/brats-2019/

https://competitions.codalab.org/competitions/17094

Reviews

Review #1

Please describe the contribution of the paper

A divide-and conquer strategy is employed to deal with the trade-off between preserving the subject-specific identity and generating healthiness images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea itself is very interesting.
2. The experimental results look promising and the comparisons with state-of-art methods demonstrate the accuracy of the method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It actually reminds the readers of the settings of Siamese Network, but not exactly the same. I would expect the authors to compare their design with it.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

If the code will be publicly accessible if the paper is accepted. That will be very valuable for the community.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Experiment on a larger test set would be better when there is more data in the future.
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I particularly like the main idea. And for the conference paper, the experiments are solid.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

1
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The paper proposes a generator-discriminator method for pseudo-healthy images. The authors also include a new loss function based on pixel intensity weighted cross-entropy, and a new image evaluation metric S_dice which measures network convergence time.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main contribution of this paper is a new method for pseudo image generation. The overall framework appears to be novel and demonstrated using two publicly available datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I have the following comments for the authors to consider:
- One major concern is the accuracy of the segmentation using GVS. Visually, it seems PHS-GAN segments better than GVS. The authors should report the segment accuracy in their results
- The newly introduced metric S_dice is not validated. Since the authors claim this new metric as part of their contribution in this paper. There should be a validation experiment for this.
- The generator G and segmentor S are only defined at a high level, and no details are provided so hampered its potential reproducibility
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility is relatively poor due to the level of model details provided and missing pre-trained models and/or code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I recommend the authors to address my comments in point 4 above,
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is an interesting work and main limitations are experimental work to validate their claims. I recommend the authors to strengthen the paper through improved experimental validation.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Somewhat confident

Review #3

Please describe the contribution of the paper

The authors designed a GAN-based model for pseudo-healthy synthesis directing on healthiness and subject identity. They verified their method on two public datasets, including BraTS and LiTS.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1- Fig.1 illustrated the method clearly. 2- Using two public datasets, including LiTS and BraTS. But, I could find only visual results on LiTS without comparison.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1- Compared methods, [3] and [12] are irrelevant to the context. 2- The proposed model was trained based on segmentation labels. But, the [3] and [12] trained based on only input volumes. Comparison with those methods is not fair, as you provide label data for your model during training. The annotated labels in the medical context are not always available. 3- The model is not clear whether it is designed based on 2D or 3D UNet! I guess that the proposed model is 2D, but the input and lesion are 3D. Using the 2D model for 3D inputs is questionable. 4- The proposed model has been implemented on LiTS (CT volumes) and BraTS (MRI). It would better to mention MRI or CT somewhere in the Abstract or Introduction section to clarify the applicability of your method. 5- The model details were not described throughout the paper. 6- Using the LiTS dataset with BraTS is not clear. Why have the authors used these two datasets? What is the difference between these datasets? 7- Comparisons result only applied on BraTS dataset. What about LiTS? 8- Providing segmentation maps during training, meaning that healthy tissues are clear for your model. Compared with the VA-GAN, not only is it not fair, but also the VA-GAN was adopted for computer vision application.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

NA
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

1- Comparison results are not satisfactory. There are many MRI and CT synthesis based on BraTS dataset. They should compare the proposed method with relevant methods. 2- Some detailed about their method will be appreciated. 3- Consider 3D model for the 3D volume and 2D model for 2D input. 4- Choosing the proper data set, and why the authors have applied LiTS and BraTS is not clear?
Please state your overall opinion of the paper

reject (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

1- Experimental results are not sufficient and cannot justify the paper. 2- Comparison with [3] and [12] would not be fair. 3- Although Figures can be used to clarify the experimental results, quantitative results can show the importance of the work more. Limited quantitative results and the proper comparisons.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Please see strengths and weaknesses of the paper summarized below. Please try your best to address the items under weaknesses and answer reviewer questions in your rebuttal.

Strengths:
- Both reviewer #1 and reviewer #2 commented on the novelty and promising results of this paper.
- The paper proposes a new generator-discriminator method for pseudo-healthy images. The authors also include a new loss function based on pixel intensity weighted cross-entropy, and a new image evaluation metric S_dice which measures network convergence time.
- Experimental results on two publicly available datasets demonstrate the potential of the proposed method and comparisons with state-of-the-art methods demonstrate the performance of the method.
Weaknesses:
- Please clarity the difference between the proposed design and the Siamese Network.
- Visually, it seems PHS-GAN segments better than GVS. The authors should report the segmentation accuracy of GVS in their results.
- The newly introduced metric S_dice is not validated.
- Reproducibility is relatively poor due to the level of model details provided and missing pre-trained models and/or code.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Author Feedback

Dear Area Chair,

We carefully read the reviewer’s comments and your summary. Next, we mainly offer explanations and supplements for the weaknesses in meta-review. For other comments made by the reviews, we will make substantial efforts to accommodate them all in the revision.

W1：Please clarity the difference between the proposed design and the Siamese Network. A1: The GVS is an adversarial training framework between the generator and segmentor, whereas the Siamese Network contains two neural networks with the same weights. They have the similar idea by replacing the hand-engineering loss with a CNN-based one. The GVS uses the segmentor to measure the differences between pathological and normal pixels, and the Siamese Network uses a twin neural network to compute the similarity between two inputs. Nevertheless, the GVS is quite different from the Siamese Network. First, they have different tasks, architectures, and optimization. Second, compared with the Siamese Network that measures the image-level similarity, our segmentor extends to measuring pixel-level similarity.

W2：Visually, it seems PHS-GAN segments better than GVS. The authors should report the segmentation accuracy of GVS in their results. A2：The review may draw a conclusion from the difference maps (PHS-GAN vs GVS: the 8-th column vs the 10-th column) in Fig.3 in the paper. We first stress that such maps are mainly used to compare the reconstruction quality of normal regions. The quantitative (the iD in Table 1) and subjective (the 8-th column vs the 10-th column in Fig.3) results show that our GVS has better reconstruction quality. Then, as the reviewer said, difference maps also can be used to partition normal and lesion regions. Visually, compared to the PHS-GAN, the GVS achieves close performance in lesion regions and superior performance outside lesion regions. Furthermore, we use the AUPR metric to compare the segmentation performance based on difference maps. The results are VA-GAN vs ANT-GAN vs PHS-GAN vs GVS: 0.211 vs 0.579 vs 0.581 vs 0.652, which indicates that our GVS significantly outperforms the existing methods. The related results and analysis will be added to Sec 3.3 in the revision.

W3：The newly introduced metric S_dice is not validated. There should be a validation experiment for this. A3: The S_dice is based on the convergence speed of networks, which varies at each trial due to network initialization and optimization. Hence, its stability should be considered when it is used as a metric. Meanwhile, its ability to correctly assess the healthiness of synthetic images is also not validated. Thus, we validate the S_dice by experiments from these two aspects. Concretely, we calculate the S_dice of GVS(10%), GVS(40%), GVS(70%), GVS(100%), and original images. Note that the GVS(10%) denotes the synthetic images generated by the GVS trained on 10% training data. Generally, the GVS(a) have more healthy appearances than GVS(b) when a > b. The results of original images, GVS(10%), GVS(40%), GVS(70%), and GVS(100%) respectively are 26.23+/-0.51, 24.55+/-0.47, 23.07+/-0.55, 22.07+/-0.6, 21.75+/-0.42. The standard deviations are calculated from 3 trials and are all less than 0.6, which suggests the high stability of S_dice. Meanwhile, the S_dice correctly measures the healthiness of pseudo-healthy images. That is, the model trained on more data synthesizes images with lower S-dice (the lower S-dice denotes more healthy appearances). The related results and analysis will be added to Sec 3.2 in the revision.

W4：Reproducibility is relatively poor due to the level of model details provided and missing pre-trained models and/or code. A4：To verify the reproducibility of the method, we provide an anonymous code link(https://anonymous.4open.science/r/GVS-204B/README.md), which contains the training code, the pre-trained model and synthetic results of one volume.

Best regards, Paper556 Authors

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors did not respond to R3’s concerns regarding comparison to other methods, especially [3] and [12]. Other concerns raised by reviewers were mostly addressed in the rebuttal. I recommend acceptance of the paper. Authors should try to address concerns in the final version if accepted.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

As all reviewers and the meta-reviewer pointed out, the paper presents a novel method for generating pseudo-healthy images, as far as this meta-reviewer can see. The proposed losses are well justified and explained. The new validation metric is quite interesting as well. R1 and R2 bring up concerns regarding the evaluation and comparisons. Authors well address these in the rebuttal. Discussions presented in the rebuttal will for sure improve the quality of the article. R3’s main concern is that compared methods are not directly relevant. Mentioned methods have not been trained with pixel-wise annotations. Unfortunately, authors have not addressed this point in their rebuttal. This meta-reviewer partly agrees that those methods use different inputs. However, they are proposed as generic methods and can be applied on these problems as well. Therefore, I do not see a problem in the comparison.

Overall, I think this is a solid contribution.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The concept of the paper is interesting and could yield a good discussion in the MICCAI context. However, despite the efforts of the authors to explain and justify some of their choices, aspects of the experimental design remain questionable after the rebuttal especially when assessing possible comparative baseline alternatives and their ways of training.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

back to top

Generator Versus Segmentor: Pseudo-healthy Synthesis