Back to top List of papers List of papers - by topics Author List

Paper Info |
Reviews |
Meta-review |
Author Feedback |
Post-Rebuttal Meta-reviews |

# Authors

Teodora Popordanoska, Jeroen Bertels, Dirk Vandermeulen, Frederik Maes, Matthew B. Blaschko

# Abstract

Machine learning driven medical image segmentation has become standard in medical image analysis. However, deep learning models are prone to overconfident predictions. This has lead to a renewed focus on calibrated predictions in the medical imaging and broader machine learning communities. Calibrated predictions are estimates of the probability of a label that correspond to the true expected value of the label conditioned on the confidence. Such calibrated predictions have utility in a range of medical imaging applications, including surgical planning under uncertainty and active learning systems. At the same time it is often an accurate volume measurement that is of real importance for many medical applications. This work investigates the relationship between model calibration and volume estimation. We demonstrate both mathematically and empirically that if the predictor is calibrated per image, we can obtain the correct volume by taking an expectation of the probability scores per pixel/voxel of the image. Furthermore, we show that linear combinations of calibrated classifiers preserve volume estimation, but do not preserve calibration. Therefore, we conclude that having a calibrated predictor is a sufficient, but not necessary condition for obtaining an unbiased estimate of the volume. We validate our theoretical findings empirically on a collection of 18 different calibrated training strategies on the tasks of glioma volume estimation on BraTS 2018, and ischemic stroke lesion volume estimation on ISLES 2018 datasets.

# Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_64

SharedIt: https://rdcu.be/cyhMH

# Link to the code repository

https://github.com/tpopordanoska/calibration_and_bias

# Link to the dataset(s)

https://www.med.upenn.edu/sbia/brats2018/data.html

http://www.isles-challenge.org/

# Reviews

### Review #1

**Please describe the contribution of the paper**This paper investigates the relationship between calibration error of a probabilistic image segmentation model and the error of volumetric estimates. Calibration error is shown to be an upper bound for volumetric bias. Moreover, the findings suggest that optimization of calibration error is to be preferred over optimization of volume bias. The theoretical findings were experimentally validated using the BRATS 2018 and ISLES 2018 dataset focusing on the task of binary lesion segmentation.

**Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.**• Clarity: the theoretical results of the manuscript are presented in a clear fashion using a good mix of mathematical derivation and explanation in plain English. • Novelty: volumetric estimates based on lesion segmentation play a fundamental role in the establishment of imaging biomarkers (e.g. for treatment response assessment in oncology). The authors provide important novel theoretical findings that refine our current understanding of the relationship between a probabilistic segmentation and error of such volumetric estimates. • Extensive validation: the authors performed extensive validation studies on the BRATS 2018 and ISLES 2018 datasets to confirm their theoretical findings.

**Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.**• Binary segmentation & class imbalance: The presented methodology was applied to binary segmentation problems only. Furthermore, I wonder how the findings presented in this study generalize to situations with more severe class imbalance (e.g. in case of multiple sclerosis segmentation or in case of multi-class segmentation problems). (see detailed comments)

**Please rate the clarity and organization of this paper**Good

**Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance**Code will be provided. Data is publicly available. Paper is well written. Hence, I am confident that the results of the paper can be reproduced with relative ease.

**Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html**• It would be great to add a small reflection on multi-class problems & class imbalance. In particular, when having multiple classes with large imbalances, I expect the majority class to dominate the calibration error. How should the calibration error be best assessed in such a situation? And more importantly, how does Proposition 1 hold up in such a situation? Do we need to look at calibration error in a stratified manner? A short explanation by the authors would have been helpful. • It might be worthwhile to look at results using the BRATS 2018 dataset separated with respect to high-grade glioma and low-grade glioma. While it is true that inter-rater variability for whole tumor segmentation in glioma is relatively low (as stated by the authors), it is less true so for low-grade glioma than high-grade glioma. Definition of tumor boundaries in case of low-grade glioma is often very difficult and I would therefore expect a wide distribution of voxel-wise confidence values. • I wonder if the Pearson’s correlation coefficient in Table 1 is the right correlation measure. Isn’t the presence of a monotonic relationship (rather than a linear one) between per-volume bias and volume size of primary interest?

**Please state your overall opinion of the paper**accept (8)

**Please justify your recommendation. What were the major factors that led you to your overall score for this paper?**Novel, well-thought through and implemented study.

**What is the ranking of this paper in your review stack?**1

**Number of papers in your stack**5

**Reviewer confidence**Somewhat confident

### Review #2

**Please describe the contribution of the paper**The focus of the paper is to investigate, both theoretically and empirically, the relationship between calibrated predictions and volume estimation in segmentation task.

**Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.**The highlights of the paper are the following results:

- Calibration error upper bounds the absolute value of volume bias.
- As calibration error goes to zero, we get unbiased volume estimates.
- Unbiased volume estimates does not imply that the classifier is calibrated.
- Linear combinations of calibrated classifiers preserve volume estimation.
- Linear combinations of calibrated classifiers does not preserve calibrations.

The theoretical findings are empirically validated on BraTS 2018 and ISLES 2018 datasets.

**Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.**I don’t have any major weaknesses to point out. However, I still think that the results section could be simplified for better understanding of the reader. Terminologies, which are indeed essential, are dominating the reading of the message/concept embedded in the sentences.

**Please rate the clarity and organization of this paper**Very Good

**Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance**I would say that the description is moderate. The authors may provide a tabulated list of parameters used in various calibration strategies in the supplementary file.

**Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html**I don’t have any major weaknesses to point out. However, I still think that the results section could be simplified for better understanding of the reader. Terminologies, which are indeed essential, are dominating the reading of the message/concept embedded in the sentences.

**Please state your overall opinion of the paper**accept (8)

**Please justify your recommendation. What were the major factors that led you to your overall score for this paper?**The paper provides a compact study of relationship between calibrated predictions and volume estimation in segmentation task. It provides a theoretical understanding which is supported empirically on two segmentation tasks. The results are important, novel, and required for progress in deep learning guided segmentation of medical images.

**What is the ranking of this paper in your review stack?**2

**Number of papers in your stack**5

**Reviewer confidence**Confident but not absolutely certain

### Review #3

**Please describe the contribution of the paper**The paper investigates the relationship between neural network calibration and biases in volume estimation. The paper shows a relationship between the volume estimator bias and overall model calibration. The paper derives this relationship and empirically verifies it on two different datasets, three training strategies, and two post-hoc calibration techniques.

**Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.**The observation of the relationship between model bias and calibration is novel, this is a step towards understanding calibrated neural networks, and the properties of their outputs over a data distribution. There is a theoretical novelty in the paper regarding the relationship between volume estimation bias and CE. The paper is theoretically sound, with clear notation, and easy to read as well.

**Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.**The main weakness of the paper is that it solely focuses on reducing volume estimation bias by reducing calibration errors. Well-calibrated models do not equate to “good” models (more in detailed comments). There is no analysis that shows if neural networks are systematically biased and their clinical implications. An overall non-zero bias in the estimator may be due to an incorrect prediction for one of its datapoints, but that is more useful than a model which outputs a constant value for any point in the dataset (the constant is chosen to make the estimator with zero CE and therefore zero bias). Therefore, the actual contributions of the paper do not seem very significant, unless some implications are shown where existing models show systematic volume estimation bias which can therefore be mitigated using model calibration.

**Please rate the clarity and organization of this paper**Very Good

**Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance**The model uses existing methods and has straightforward criteria to evaluate the relationship between bias and calibration error. The datasets and configurations are described with a good amount of detail. Therefore, the paper should be easily reproducible.

**Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html**-> My main concern for the paper is that having a well-calibrated and unbiased model is easy, for example, a model which outputs a constant value - the mean of all probabilities over the dataset is perfectly calibrated and therefore has zero bias, but is uninformative. Consider an example with 100 positive and 100 negative examples, an estimator f(x) = 0.5 will have zero calibration error and zero estimator bias. A model needs to be both “accurate” and calibrated to be useful. Simply having an unbiased or uncalibrated model is not enough, which seems to be the main contribution of the paper - to empirically verify the relationship between bias and CE. -> In figure 2, the aspect ratio of the axes are not 1:1 (since bias and CE are calculated in the same units in the Methods section). This makes the visualization misleading (ideally one would like to see how much the points deviate from the line y=x). The line y=x can be plotted as well. Please change the figure to reflect the changes. -> The paper also mentions “there is no consistent finding that holds for both datasets” - this means that CE is simply a loose upper bound of the bias, and the amount by which they differ cannot be easily controlled - this is also evident from the PCC values in Table 1. This makes the paper even more unclear.

**Please state your overall opinion of the paper**borderline reject (5)

**Please justify your recommendation. What were the major factors that led you to your overall score for this paper?**The paper introduces a good theoretical tool for using calibration error as an upper bound of the model estimation bias for volume estimation. However, the contribution is limited to empirically verifying it. The model doesn’t highlight if existing neural networks have systematic estimation biases and their clinical significance. My recommendation is therefore a borderline rejection.

**What is the ranking of this paper in your review stack?**2

**Number of papers in your stack**3

**Reviewer confidence**Somewhat confident

# Primary Meta-Review

**Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.**Summary: Investigates the relationship between calibration error of a probabilistic image segmentation model and the error of volumetric estimates. Calibration error is shown to be an upper bound for volumetric bias. Moreover, the findings suggest that optimization of calibration error is to be preferred over optimization of volume bias.

Positives:

- Important novel theoretical findings that refine our current understanding of the relationship between a probabilistic segmentation and error of such volumetric estimates.
- Extensive validation on the BRATS 2018 and ISLES 2018 datasets that confirm theoretical findings.
- Clearly presented and theoretically sound.

Negatives:

- Methodology only applied to binary segmentation problems only.
- Well-calibrated models may not equate to “good” models.
- Does not highlight whether existing neural networks have systematic estimation biases and their clinical significance.

**What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).**2

# Author Feedback

We thank all the reviewers for their thoughtful and overall positive feedback. We are very pleased that they all agreed that the work is novel, theoretically sound and well written. Moreover, they said that the results are “required for progress in DL-guided segmentation of medical images” (R2), that the findings are “extensively validated” (R1) and represent a “step towards understanding calibrated NNs” (R3). The reviewers also had several constructive suggestions and questions that we will address below.

R1: multi-class problems & class imbalance: The binary setting can be extended to multi-class problems. The extension of calibration methods is usually by “treating the problem as K one-vs-all problems”, as explained in [13] (Sect. 4.2). The bias is measured per-class, so for each individual class we can get a bound. In the case of class imbalance, weighted/scale-invariant versions of CE are possible. Prop. 1 holds regardless of the ratio of the pixels.

We thank R1 for the other two suggestions, for which we conducted additional experiments. We will add a note in the camera-ready version and we will include the full analysis in an extended version of this work (to be submitted to a journal).

R1: HGG vs LGG We now calculated correlations between the (mean absolute per-volume) Bias and (mean per-volume) ECE for the 18 settings on the BR18 dataset separated into HGG and LGG. The Pearson correlation is 0.15 ± 0.23 for HGG and 0.39 ± 0.20 for LGG. Compared to the same analysis for BR18 vs IS18 (IS18 has higher inter-rater variability analogously to LGG) in the caption of Figure 2, we observe again that the correlation is higher for the data with higher uncertainty (LGG). However, in the case of HGG vs LGG, the difference is less pronounced.

R1: Pearson correlation We calculated the Spearman ρ and Kendall τ coefficients for all settings. In all cases there are significant non-zero correlations and the median values are respectively for BR18 -0.43 and -0.29, and for IS18 -0.43 and -0.30.

We appreciate the two suggestions by R2. We will do our best to simplify the results section and incorporate more details about the training and calibration process of the pre-trained networks that we used. The full details are given in [24].

R3: Well-calibrated models may not equate to “good” models True, but “good” is measured across multiple axes depending on the use-case. If we are only interested in volume estimation, it is sufficient to have a well-calibrated model on a subject level. In that case, it would be perfectly valid to have a constant output for all pixels per subject. We would like to emphasize again that the premise in the paper is that the calibration is done per image. Therefore, the example in the review is incorrect because it would mean the algorithm predicts for all pixels and for all subjects identical tumor volumes. We would like to thank the reviewer for the opportunity to make this important clarification.

R3: Does not highlight existing NNs their systematic estimation biases and its clinical significance We disagree with this remark because, based on our analysis with SOTA models, we showed that small tumors have positive bias (are overestimated) and large tumors have negative bias (are underestimated). This is a directly clinically relevant finding which we explicitly discussed in Sect. 4.

R3: There is no consistent finding that holds for both datasets It is true that the correlation between volume size and Bias (Table 1) is not influenced by the loss function or post-calibration method in a consistent way across both datasets. However, this by itself is an important finding. We chose these two datasets precisely because of their different levels of inherent uncertainty. On the other hand, from Figure 2 we see that there is a very strong correlation between ECE and |Bias| for IS18, but also for BR18, conditioned on the type of post-calibration method (i.e. Platt, auxiliary and finetune vs. MC methods).