Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Esther Puyol-Antón, Bram Ruijsink, Stefan K. Piechnik, Stefan Neubauer, Steffen E. Petersen, Reza Razavi, Andrew P. King

Abstract

The subject of ‘fairness’ in artificial intelligence (AI) refers to assessing AI algorithms for potential bias based on demographic characteristics such as race and gender, and the development of algorithms to address this bias. Most applications to date have been in computer vision, although some work in healthcare has started to emerge. The use of deep learning (DL) in cardiac MR segmentation has led to impressive results in recent years, and such techniques are starting to be translated into clinical practice. However, no work has yet investigated the fairness of such models. In this work, we perform such an analysis for racial/gender groups, focusing on the problem of training data imbalance, using a nnU-Net model trained and evaluated on cine short axis cardiac MR data from the UK Biobank dataset, consisting of 5,903 subjects from 6 different racial groups. We find statistically significant differences in Dice performance between different racial groups. To reduce the racial bias, we investigated three strategies: (1) stratified batch sampling, in which batch sampling is stratified to ensure balance between racial groups; (2) fair meta-learning for segmentation, in which a DL classifier is trained to classify race and jointly optimized with the segmentation model; and (3) protected group models, in which a different segmentation model is trained for each racial group. We also compared the results to the scenario where we have a perfectly balanced database. To assess fairness we used the standard deviation (SD) and skewed error ratio (SER) of the average Dice values. Our results demonstrate that the racial bias results from the use of imbalanced training data, and that all proposed bias mitigation strategies improved fairness, with the best SD and SER resulting from the use of protected group models.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_39

SharedIt: https://rdcu.be/cyl4m

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper analyses fairness of DL-based cardiac MR segmentation techniques and applies methods to mitigate bias. Four approaches are used to study fairness. Baseline is “fairness through unawareness”. Remaining three are “fairness through awareness” - 2 preprocessing methods (equal representation of protected groups in training batch, training separate models for each group) and 1 in-processing method (classifier to predict protected attribute). The nnU-Net is used to segment LV+RV blood pool and myocardium in ED and ES slices. Segmentation is evaluated using DSC, and fairness is evaluated using SD and SER of avg DSC values. Imbalanced dataset used from UK Biobank - 5903 subjects (training/validation/test sets of 4,723/590/590). Results : Bias mitigation strategies improve against the baseline. Stratified batch sampling and Protected group models are better at ensuring fairness across groups.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-The paper is well written and clear. -Use of 5903 cases from a publicly available dataset. -Results show a reduction in the SD and SER for the bias mitigation strategies against the baseline when the data is imbalanced.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-Absence of images showing the bias in the results across racial groups. This would have been interesting to see how the models are predicting across groups. -The paper studies bias due to data imbalance, and not due to any other potential factors.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper provides training (nnU-NET, DenseNet) and hyperparameter (epochs, optimizer, learning rate) details that can be helpful in reproducing the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The experiments show bias mitigation strategies in the presence of data imbalance. How do the same strategies work when the population sizes are almost the same? Such additional experiments would have provided insights into potential biases due to other factors (algorithmic perhaps).
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written and has a good set of experiments to demonstrate bias due to data imbalance. More experiments are needed to cover all possible sources of bias in DL-based methods.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper presents a study on group balance to train deep learning method, which affect the method performance on specific group. Author conducted different approaches to mitigate this imbalance problem, and used cardiac MRI segmentation as an example.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Imbalance in training data is an important issue in a general machine learning, particularly in a classification task. It’s surprising, and also interesting at the same time, to see ethnicity factor affect segmentation performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I understand the motivation of this study, but the experimental design is poor (see my detail reviews below), and the contribution to the medical imaging community is very limited.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors did not develop a method, so everyone can perform the similar study like this.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
The paper starts from the hypothesis that there is an unfairness bias in the training data in a deep learning network for segmentation. Then authors designed an experiment with cardiac MRI segmentation problem and concluded that there is no bias in gender but there is in the ethnicity.

This raises questions that are not answered / analysed by the authors. Particularly,
1. Is it only specific to cardiac MRI segmentation?
2. Does the unfairness actually come from the way nnUNet method learn? How about other segmentation methods?
3. Did author check whether there were differences the way the expert drew contours between ethnic groups? Were there differences in the ground truth between ethnic groups?
4. Does the bias affect the whole myocardium or in particular areas?
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper does not contribute significant knowledge to the medical imaging community.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

4
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The paper dives into the concept of ‘fairness’ in medical imaging AI systems. Using MRI cines from the UKB (n=5,903) the authors propose three bias mitigation strategies, and compare these to a baseline ‘blind’ scenario (a posteriori). The conclusion is that bias exists but can be reduced with these approaches, depending on the availability of the clinical data.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The topic is important, has merit and the MICCAI community would definitely benefit from this work.

The data, the methods, the design of the research questions, the approaches, and the analysis and discussion, as well as the statistical measurements are all excellent.

Could the authors comment on whether this framework might benefit the correction of bias in imaging protocol.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Without a full comprehensive search of literature, it is hard to assess the claims around novelty, but I have not seen this topic covered before in the context of cardiac imaging. I would have liked to see supplementary material around the search terms and strategy; but trust the authors on their claims.

Perhaps some of the results presented in tables would have been better visualized as some sort of line plot with the different groups.

May the authors consider using 0.01 as a cut-off p-value, as it’s been widely discussed. Also consider Scheffé instead of Bonferroni’s correction (more conservative).
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

UKB data can be applied for. The authors have not released their code but their methods and formulae are reported in enough detail that the fairness framework itself would be reproducible, I believe.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Table 1 in [https://doi.org/10.1186/s12968-014-0056-2] might add more ‘fuel’ to your rationale, also in a large older study.

How could this work help with explainability (XAI) frameworks?
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a very strong paper, the type of which will set the tone and frame many future papers in the area, I suspect. I have no major concerns.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper explores an important topic but based on the reviewers’ and AC’s assessment it is done poorly. The authors are invited to provide a rebuttal. Different aspects of the paper are questioned in the reviews.
- The clinical context is not clear.
- There are no qualitative results showing bias.
- Reading the paper, it seems like the authors are exploring the data imbalance problem with respect to the demographic variables and not bias, per se.
- The authors only use two standard deviation (SD) and skewed error ratio (SER) metrics as fairness metrics, while they have ignored all the metrics in the fairness literature exploring parity and equality of opportunity.
- There is no analysis of bias effects on images or measurements driven those them.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Author Feedback

We thank the reviewers for their comments and are pleased that 2 of the 3 reviewers recommended acceptance. We believe that, as the first work investigating fairness in any form of image segmentation (not just medical), this is an important paper, “which will set the tone and frame many future papers in the area” (R4). R3 expressed concern about aspects of the experimental design, specifically:

“Is it specific to CMR segmentation?”: We clearly stated (title/abstract/introduction/discussion) that we were assessing bias in CMR segmentation, and to generalise to other modalities or regions of the body a similar analysis will be required.

“How about other segmentation methods?”: As this is the first analysis of fairness in segmentation we chose to use the most high-performing model, the nnU-net. This is based upon the U-net which is widely used in medical imaging. We believe it is highly unlikely that the results are specific to nnU-net and note that in the fairness literature it is common practice to employ such leading models and not to perform extensive tests across different architectures.

“Were there differences in ground truth between ethnic groups?”: We will add brief text to the paper to clarify that the ground truths were each generated by 1 of 7 experts who followed the same guidelines and were blinded to ethnicity. Each expert contoured a random sample of images containing different ethnic groups.

“Does the bias affect the whole myocardium or different areas?”: R2 and R4 also requested some qualitative results and we accept this point. We will include a figure showing sample cases for selected ethnic groups with high/low dice to illustrate the differences. R3 also stated that “the contribution to the medical imaging community is very limited”. We strongly disagree with this and note that it is conflicting with the other reviewers. Clinicians have known for years that modern medicine is biased because most research studies and clinical trials are performed predominantly on Caucasian men (doi.org/10.1016/S0140-6736(19)30315-0, doi.org/10.1186/s12910-020-0457-8). AI offers the chance to address this problem but unless fairness is explicitly addressed this will not happen. This is a very important area that has been mostly neglected to date. R2 asked about the influence of other factors than imbalance on bias and asked what the results would be like if the training database was balanced. We agree with this suggestion – we have now evaluated the baseline approach using a dataset with balanced ethnicity. Due to the limited number of subjects for some ethnicities this resulted in a much smaller dataset (6 ethnicities x 58 = 348 subjects). As expected, the overall Dice was much lower (80.33%) but also the bias metrics were very low (SD=0.65, SER=1.09). This suggests that imbalance is the main factor causing the bias and justifies our focus on this issue. We will include this result in the final paper (1 extra row in Table 2 and brief explanatory text). The AC raised two further points that we would like to answer:

Only 2 fairness metrics used: As no prior work has investigated segmentation fairness we had to start from scratch here and propose our own metrics. We note that most common fairness metrics (such as those related to parity/equality of opportunity) are designed for classification and so are not obviously applicable to segmentation. The two metrics we chose were the ones that could be adapted to segmentation but we are open to other specific suggestions.

No analysis of bias on image-derived measurements: We agree that this is valuable but in this paper we limit the scope to segmentation performance. We plan a clinical journal paper where such results will be reported. R4 raised minor concerns about the use of p=0.05 and Bonferroni’s correction. We thank the reviewer, and for the final paper we will use a cut-off of p=0.01 and Scheffé’s correction. We will also add the suggested reference to the MESA paper.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Although the authors have provided a comprehensive rebuttal, it did not address the concerns in the initial meta-review! Unclear definitions of fairness, mixing the sampling bias (imbalance data with respect to demographics) with fairness literature, and not including proper metrics for evaluation are the major points for a strong rejection recommendation. The paper requires comparison with baselines that take care of data imbalance with respect to protected variables (e.g., stratification, or matching, or even more advanced adversail methods). It also needs more relevant fairness metrics (e.g., equality of opportunity, parity, mutual information, or some sort of correlation).
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

25

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper addresses an important but rarely addressed problem in medical image analysis. The technical novelty is somewhat limited to extension of existing fairness techniques from classification to segmentation. The best results are obtained by training separate models for the protected attribute, e.g. race, which has problematic aspects as it will require exponentially increasing dataset sizes if multiple protected attributes are considered. The rebuttal clarifies that the largest source of unfairness is data inbalance. This is a borderline paper, but I am leaning towards accept if only to bring more awareness to the MICCAI community about this important problem.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I agree with the meta reviewer that the paper are exploring the data imbalance problem with respect to the demographic variables and not bias. The argument is not valid and can be misleading given that the bias is caused by data imbalance instead of by the ethnicity of the subject.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #4

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After submission of reviews and metareviews there was a further discussion on bias and fairness versus data imbalance. PCs assessed the paper reviews including meta-reviews, the rebuttal and the submission in more detail. While ACs and reviewers agreed on the importance of the topic they disagreed on whether the quality of this paper is sufficient for presentation at MICCAI. Main concern seems to be that the paper lays an unnecessary strong claim of addressing “fairness” which raises the expectations that fairness and causes of biases in general are addressed rather than the more narrow focus on bias due to data imbalance. The focus of the paper should be more clear in title/abstract and in discussion of related literature. PCs recommend to conditionally accept the paper provided these points are addressed.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

-

back to top

Fairness in Cardiac MR Image Analysis: An Investigation of Bias Due to Data Imbalance in Deep Learning Based Segmentation