Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Rongguang Wang, Pratik Chaudhari, Christos Davatzikos

Abstract

Heterogeneity in medical data, e.g., from data collected at different sites and with different protocols in a clinical study, is a fundamental hurdle for accurate prediction using machine learning models, as such models often fail to generalize well. This paper leverages a recently proposed normalizing-flow-based method to perform counterfactual inference upon a structural causal model (SCM), in order to achieve harmonization of such data. A causal model is used to model observed effects (brain magnetic resonance imaging data) that result from known confounders (site, gender and age) and exogenous noise variables. Our formulation exploits the bijection induced by flow for the purpose of harmonization. We infer the posterior of exogenous variables, intervene on observations, and draw samples from the resultant SCM to obtain counterfactuals. This approach is evaluated extensively on multiple, large, real-world medical datasets and displayed better cross-domain generalization compared to state-of-the-art algorithms. Further experiments that evaluate the quality of confounder-independent data generated by our model using regression and classification tasks are provided.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_17

SharedIt: https://rdcu.be/cyl3T

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This work addresses an important problem of data harmonization of multi-site neuroimaging data. The relationship between the covariates (known and unknown) and the neuroimaging measures is structured as a structural causal model (SCM) which is learned using a normalizing flow model. The harmonization results are evaluated by comparing the site-wise hippocampal volume distributions and out-of-distribution prediction tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The SCM framework is interesting and nicely motivated.
2. The work is overall well written.
3. Tackles an important issue of data harmonization.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The combination of SCM and a normalizing flow seems to enable a conditional invertible formulation where the “conditions” are the confounders to control for. A relevant work which directly enables a conditional invertible neural network in a normalizing flow manner exists: Ardizzone, Lynton, et al. “Analyzing inverse problems with invertible neural networks.” arXiv preprint arXiv:1808.04730 (2018). This may achieve the conditional generation of the measures while controlling for the confounders.
2. The harmonization evaluations are weak. The hippocampal volume figure shows the minimal impact of the proposed work. What exactly are the “unknown” confounding factors, and how should they be preserved? What is the expected harmonization outcome? With many moving targets, it becomes hairy to properly assess harmonization.
3. The MAE results are well within the margin of error with the provided standard deviations. Even without considering the std, the amount of MAE improvements of the best SCM setup vs. other baselines seem insignificant overall. Similar issues arise in AD classification.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Sufficient information is provided in text but no code is provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

My comments are in the weaknesses section. I would greatly appreciate if the authors respond to my questions.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed work nicely motivates the approach and the problem, but a potential alternative is of concern. More importantly, it is difficult to assess the harmonization results based on the provided evaluations.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The paper introduces a novel method to address the challenge of generalization in a scenario where there are differences between target and source data. This is an intriguing paper that leverages the recently introduced flow-based method and expands its usage for performing counterfactual inference. The generic approach is applied to the case of inferring age and Alzheimer’s Disease from MRI data. The selected use case fits well the generic approach and enables the creation of meaningful structural causal model and testing on two different tasks, regression and classification.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addresses the important challenge of generalisation of deep learning algorithms that are trained on homogenous medical data (e.g. images coming from one medical center using devices from a single manufacturer). It leverages a wide set of open data of MRI images to estimate performance. Empirical results look good.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I am aware that in the past years many algorithms have been used for transfer learning tasks in general, and in the case of brain MRI in particular. I do not feel comfortable with the selection of the specific two algorithms for comparison – IRM and ComBat. I suspect there could have been other who would have performed better. A very recent paper reviewed all the studies on this topic (see “Transfer Learning in Magnetic Resonance Brain Imaging: a Systematic Review” at https://arxiv.org/pdf/2102.01530.pdf) where 31 papers are reported to perform studies on the task of AD classification and 8 are reported to perform studies on the task of age regression. Given the high variance reported in Tables 1 and 2 it is not clear if the differences in performance between the algorithms is actually noise or not. Statistical testing for the significancy of the results is missing.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The data is freely available and the code will be made available
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Please note that there are many typos in the paper, below are examples
1. The first sentence of the abstract has a typo: “… as machine as such models often fail to generalize well.”
2. Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) are mentioned in the text (within the related work) without providing the full name before switching to the initials. I suggest fixing that.
3. The following sentence seems to have an internal contradiction: “Although these methods perform well empirically, the GANs suffer from convergence issues, especially for 3D images.”; It is not clear, are the GANS perform well empirically or not?
4. The following sentence (in Methods) is unclear: “The second, interventions asks questions like “what happens if we do . . . “.”
5. In section 4.1 a word is missing in the sentence “…shown in Appendix”
6. In section 4.1 there is a typo “…and ComBat [14,27] on on two tasks…”
7. In the supplement material, Table 3, I trust the letter ‘k’ in the table is a typo
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See above comments
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper proposed a flow-based Structural Causal Models (SCM) to show a more robust model by performing counterfactual inference. The method is tested using 3D MRI data in density estimation, age prediction, and Alzheimer classification tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The author proposed a novel normalizing-flow-based method to harmonize data between the source domain and target domain and showed better cross-domain generalization results.
- Method is tested on both regression and classification tasks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Although the proposed method can achieve the best results in the target domain, but the results in the source domain seem to be sacrificed.
- How to choose from different normalizing flows?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It should be able to be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The method seems to work well. Two questions: How to select from different normalizing flows? How to maintain the high accuracy in the source domain?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Method is novel, experiment results are good and compared with strong baselines.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Not Confident

Review #4

Please describe the contribution of the paper

In this work, the authors propose to harmonize brain MRI scans by answering the counterfactual question: “how would this scan look were it to come from another scanner?” To support this question, the authors model the acquisition as a structural causal model (SCM), then fit a normalizing flow to enable sampling from the distribution of scans. The methodology is evaluated on age prediction and AD classification tasks with iSTAGING and ADNI datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The primary novelty of this work is the use of a normalizing flow to easily sample from a distribution defined by a SCM. Further, the quantitative results show improvement over the single comparative method w.r.t. both age prediction error after harmonization as well as AD classification.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This work has several weaknesses. First and most prominently, the goal of the work is to harmonize scans such that they come from the same site. However, no qualitative images are shown (even in the supplementary material). Second, there is missing information about the design and implementation of the flows. Since the flows are parameterized by a neural network, it is vital to include at least some details of the network designs. Third, there are details missing regarding the counterfactual queries. The dimensionality of the data is not described but can be assumed to be 2D slices or 3D volumes; this implies that the exogenous x noise has the same distribution due to the bijection demanded of the flow. During abduction, how are samples drawn? It is not stated, but can be assumed that every element of x must be sampled independently from \epsilon_x and sent through the flow. Is a slice/volume generated in this way a cohesive, sensible volume? Again, this question would be answered by the inclusion of an image or some statement describing the consistency across pixels / slices. Fourth, there is a lack of statistical tests to judge whether the method is significantly better than the baseline. Standard errors are reported, but this is insufficient since the proposed method appears to have more variance than the baseline. Fifth, some details are missing regarding the data split during training/validation. Sixth, there are several other disentangling harmonization techniques which can be compared to as a baseline (a simple search on Google scholar turns up many recent results). It would be important to compare to those methods as a baseline, where images are harmonized and subsequently predicted upon for both tasks.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This work is difficult to reproduce. There are not enough implementation details to replicate the modeling process, particularly the flow specification. Further, there is no description of the training/validation/testing split used in any experiments.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

In general, the idea presented has good foundations and is promising. However, shortcomings in presentation and lack of many methodological details as well as a lack of a rigorous analysis are what mainly holds this paper back. Most importantly, some images should be shown (for an imaging conference!) to demonstrate the qualitative effect of harmonization.
Please state your overall opinion of the paper

reject (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major flaws listed above led to the rejection decision. In sum, there are too many missing methodological details and analysis to recommend an accept.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

7
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Summary: Addresses the important problem of data harmonization of multi-site neuroimaging data. The relationship between covariates (known and unknown) and the neuroimaging measures is organised as a structural causal model (SCM), which is learned using normalizing flow. The method is tested using 3D MRI data in density estimation, age prediction, and Alzheimer classification tasks.

Positives:
- Tackles the important issue of data harmonization.
- The SCM framework is interesting and well motivated.
- Used a wide set of open MRI data to estimate performance.
- Clearly written and should be reproducible.
Negatives:
- Other baseline methods may have been more appropriate.
- Missing information about the design and implementation of the flows, as well as many other aspects of the method and data, such as how data were split into training/validation/testing. This could be addressed in a revision.
- Statistical tests may be needed to assess whether the proposed method offers real improvements. This could be addressed in a revision.
- No images shown to allow visual assessment of degree of variability across sites. This could be addressed in a revision.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Author Feedback

We thank the reviewers and the meta-reviewer for their feedback. The reviewers agree that our work tackles an important problem (R2, R3), is a novel/interesting approach (R2, R3, R4, R5) with good empirical performance and diverse evaluation metrics (R3, R4). We next address concerns raised by the reviewers.

Show images to assess variability We use ROI features (see §4.1), not images. Therefore some of R5’s concerns on how the flow is used to draw samples do not apply. A partial understanding of the variability of raw features can be obtained from Fig 2. We will add histograms of features in the supplementary material.

Other baseline methods We have been diligent in picking baselines. We selected algorithms that can work without access to target labels. Doing so is the central point of the paper (see §1). IRM and ComBat (https://doi.org/10.1016/j.neuroimage.2019.116450) are widely used for harmonization. TarOnly and SrcOnly are two more baselines that help comparison. R3: the methods in the review paper on transfer use non-standard datasets/inclusion criteria so numerical performance is difficult to compare with them directly. R5: can you please be more specific about what you would like us to compare to? We are not aware of any disentanglement-based harmonization method that has been evaluated on ROI features.

We implemented two more baselines: ComBat++ (https://doi.org/10.1016/j.media.2020.101879) and CovBat (https://doi.org/10.1101/858415); these will be added to Tables 1,2. Our method outperforms ComBat++ by 15% (p<1e-86) and CovBat by 4% (p<1e-69), on-average, for AD classification. Our method is marginally better (0.2–0.5 in MAE, p<0.38) for age regression than both these methods.

Missing details We will add a table that describes the architecture and hyper-parameters to the Appendix. In brief, we predict logits for sex and site and fit a normalizing flow for other structural assignments. Standard Gaussian is used as the density for all exogenous noise. A linear flow and a conditional flow (conditioned on activations of a fully-connected network that takes age, sex and scanner ID as input) are used as structural assignments for age and ROI features respectively. 145 ROI features are used (see §4.1) In the abduction step, samples come from training data, not simulation. We infer exogenous noise using the flow (see §3.1). As the caption of Table 1,2 says, we did 5-fold cross-validation (80% train, 20% val).

Statistical tests to assess improvements The following p-values will be added to Table 1,2. The hypothesis that our Q-Spline based harmonization achieves a better accuracy than IRM, ComBat Linear and Combat GAM for both ADNI-1 and ADNI-2 can be accepted with p<1e-5. Age prediction has higher p-values (0.06–0.41, UKBB has highest). However, pragmatically speaking, age is an easy to observe feature so our results on age regression should be interpreted as sanity checks. We will modify the narrative to say so.

R2 invertible networks are not novel Our contribution is a method to perform counterfactual queries for harmonization, not to build an invertible model. We use existing methods to build a conditional normalizing flow. We will cite the paper on invertible networks that the reviewer mentions.

R2 Hippocampus volume figure shows minimal impact The reviewer is misunderstanding. Fig 2 shows that the distribution of hippocampus volume is unchanged (compared to raw) across sites using our method. We preserve the unknown confounders (subject-specific information due to biological variability, such as race, gene, and pathology AD/CN) by formulating them as exogenous noises in the SCM. In contrast ComBat-Linear removes these useful confounders which is detrimental to accuracy (Table 1, 2).

R4 accuracy in source domain is smaller No, age MAE of Q-Spline is better than source in all cases, so is the case for AD except for ADNI-1 where it is marginally smaller but has a much higher target accuracy.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed the concerns of R2, including two apparent misunderstandings of the reviewer (objective of the work and contents of Fig 2). A statement made by R2 about the accuracy of the method in the source domain has mostly been refuted. Other reviewers had less significant concerns. However, the work concerns harmonising features extracted from images, rather than harmonising the images themselves (R3 did not realise this), so may not not fit squarely within the MICCAI scope.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

16

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

As summarised in the meta-review, the topics addressed by this paper was found relevant, while the contribution interesting and well motivated. Major remarks were on the statistical significance of the results, and the use of appropriate baselines for the benchmark. The paper was found also to lack clarity on the implementation details. The rebuttal focused on the justification of the chosen baselines and included two additional methods. More details on the architecture and parameters were given, and a more detailed discussion of the statistical results and figures was provided.

While it is true that the statistical improvement for the age prediction task is not striking, the proposed formulation and methodology appear interesting and well motivated. There was some misunderstanding in the reviews that was clearly addressed in the rebuttal. Overall, provided that the clarifications are included in the final version of the manuscript, the paper could make a good contribution to the conference.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Although the initial scores are not good enough, I believe the authors have addressed the main concerns from the reviewers in the rebuttal, therefore, I would like to recommend accept.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

back to top

Harmonization with Flow-based Causal Inference