Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Anthony Sicilia, Xingchen Zhao, Anastasia Sosnovskikh, Seong Jae Hwang

# Abstract

Application of deep neural networks to medical imaging tasks has in some sense become commonplace. Still, a “thorn in the side” of the deep learning movement is the argument that deep networks are prone to overfitting and are thus unable to generalize well when datasets are small (as is common in medical imaging tasks). One way to bolster confidence is to provide mathematical guarantees, or bounds, on network performance after training which explicitly quantify the possibility of overfitting. In this work, we explore recent advances using the PAC-Bayesian framework to provide bounds on generalization error for large (stochastic) networks. While previous efforts focus on classification in larger natural image datasets (e.g., MNIST and CIFAR-10), we apply these techniques to both classification and segmentation in a smaller medical imagining dataset: the ISIC 2018 challenge set. We observe the resultant bounds are competitive compared to a simpler baseline, while also being more explainable and alleviating the need for holdout sets.

# Link to paper

SharedIt: https://rdcu.be/cyl4N

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

This paper points out that many PAC-Bayesian frameworks focus on vacuous bounds that could be meaningless in medical imaging applications. The authors discussed non-vacuous bounds and showed segmentation results on smaller medical imaging datasets.

I believe the authors pose an interesting problem, but I am not sure if the current analysis is adequate enough. Please see item 4 for more details.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors study non-vacuous PAC-Bayesian bounds to obtain sensible bounds on generalization error, which has not been paid much attention in medical imaging analysis.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• In terms of novelty, although the authors employed a non-vacuous bound which has not been considered much in medical imaging analysis, the bound itself is usual in ML as the authors quote Theorem 1 from the previous work. I am less sure if the novelty is strong enough.

• The relationship between this non-vacuous PAC-Bayesian and the flat minima is not clear. For example, once we have a ‘good’ posterior, how does this relate to the flat minima? Also, the discussion in Section 3 is a bit confusing. When a large variance is allowed, isn’t it simply a failure of training due to the too big regularization? I may be missing something at this point. If there are particular reasons to consider the flat minima in the context of PAC-Bayes, please let me know.

• Figure 1 (a) shows that the non-Bayesian methods show higher DSC. Does it mean non-Bayesian method is preferable for better performance? If so, why do we should consider the Bayesian method? Is it really important to show the considered method archives lower bound?

• Please rate the clarity and organization of this paper

Satisfactory

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide implementation codes and it looks fine, although the reviewer hasn’t executed the codes.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I believe the authors clearly provide why the non-vacuous bounds are necessary for medical imaging analysis. But based on numerical experiments, it is not fully convincing why the PAC-Bayes method is important compared to unregularized methods (or even other vacuous PAC-Bayes methods were not compared). Also, the discussion about flat minima is confusing. Please see the item 3 and 4.

• Please state your overall opinion of the paper

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see item 4.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Somewhat confident

### Review #2

• Please describe the contribution of the paper

The paper experimentally shows that the PAC Bayes generalization bounds are non vacuous and useful to understand generalization of deep networks in medical image classification/segmentation tasks, especially in the context of low amount of data.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

I found the result of the paper to be quite interesting. The possibility of obtaining a non vacuous bound on the performance of deep learning model in small amount of data could be quite useful in medical applications, especially since the amount of data is low, and we don’t usually have a confidence on generalization ability of deep networks. The paper shows that, at least in two cases, the bound matches the actual performance quite well.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There are a few limitations of the paper as described below:

1. The paper talks a lot about the previous works on PAC bounds and its interpretations like flat minima and stability arguments. But, since the paper is about adaptation of the bound to medical images, it should be more focused on the results they have presented in this paper.
2. The result is quite limited. The main paper only consists of results on Segmentation task using U-net and its light weight version. I suggest authors to move the classification results from appendix to main text and add a couple of other datasets to verify that the generalization is general.
3. The writing could be improved.
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

There are enough details on architectures and data. The scripts are also made available. Overall, the paper is reproducible although I had some confusion on the Hoeffding’s bound and lower bound.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

To improve this paper, authors should focus on showing that the idea applies to a couple of other datasets and different architectures in medical imaging. This is especially useful since the contribution is on showing usefulness of already known ideas.

Another comment is on the writing. I believe the writing could be more organized. Right now, a lot of focus is towards describing what is known in the literature about PAC Bayes bounds. I suggest authors to be a bit more focused on their contribution in this paper.

There were a couple of confusions, perhaps authors can improve their presentation by considering them. First, I am not clear why 50% split is used in both the training of prior and posterior. I thought, the prior should not be influenced by data. By training both the posterior and prior on the same set, aren’t you violating this assumption? This is confusing to me. Instead of pointing to the Dziugaite et al., I suggest authors to focus on these details because they influence the interpretation of the results in this paper.

A minute comment. I was really confused by what lower bound meant. My guess is that it is negative of the the upper bound on eq. (2). Please clarify.

In Fig 1(a), the results corresponding to U-Net and LW without PBB; are the lower bounds there obtained by Hoeffding bound? And how exaclty is that computed? I am very confused here. Please clarify.

It is very unclear how Fig 1(c) is related to flat minima. Fig. 1(b, c) shows that as you increase the prior variance both performance and lower bound decreases. But, flat minima hypothesis is that the generalization is better when you are at a flat position. Not very clear how these are related. Perhaps, this could be clarified.

• Please state your overall opinion of the paper

borderline reject (5)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main idea is quite interesting. But, the paper could be improved in writing and strengthened by adding more experiments to justify the conclusion.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

4

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper
  The manuscript applies the PAC Bayesian framework to a relevant medical image segmentation dataset and a full-grown U-Net architecture in order to provide guarantees on the generalisation error via meaningful bounds.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
  - The notation of the paper is very concise and compact.
- The idea is interesting and the topic of the paper is very relevant.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
  - Very little details on the actual computation of the KL(Q||P) term is given. There is no formula for the "tighter formulation of Eqn. (2)" and there is no algorithm box.
- Essentially, the "middle part" of the manuscript is missing i.e. the connection between the (well-described) PAC Bayesian theory and the assumptions and resulting computations in the empirical validation. Those need to be described and illustrated. In particular, the parameters, the sampling strategy, the concrete form of the bound etc. A flow-chart making clear, in which order things need to be computed and which ideas and assumptions enter the stage could be helpful.
- The experiments are a first step but could be extended to cover more aspects.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
  The authors add code to run their method. This should in principle, allow to replicate the results of the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
  - It is not fully clear from the text which values of delta are used in the experiments i.e. you should explictly say that this is the confidence. Also the numbers for m need to be added.
- You need to make clear how the bound depends on the segmentation network and how it depends on the training process. This information is somewhat contained in prose form but not coined to a flow chart or formula.
- The manuscript should provide an outline how the empirical results can help analyze/understand/improve segmentation architectures. At least in principle.
- For the average reader, the paper would benefit from an explication, in which aspects and why the PAC Bayesian framework goes beyond training a model and -- once converged -- averaging predictive accuracy on a validation set for different degrees of jitter on the found parameters.
- The

• Please state your overall opinion of the paper

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Providing (at least partial) theoretical backup for relevant medical image processing networs is of scientific and practical interest. The manuscript makes a step in this direction with a clear and concise notation for the basic formalism but much less details on the actual model.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Somewhat confident

### Review #4

• Please describe the contribution of the paper

The paper provides an excellent overview of recent advances in the PAC-Bayesian framework. It demonstrates how these recent advances are applicable in the unique but challenging task of medical imaging. The paper provides subsequent analysis on the task of segmentation and classification in the popular medical image dataset.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• Generalization is an important topic for the medical imaging community, and this work provides experimental results demonstrating evidence to support recent but well-received generalization bound.

• The paper has explored the non-vacuous bounds in the task of segmentation for the first time and on medical imaging, which are often challenged with the limited labeled dataset.

• The paper is well-written and outlines how the knowledge of these theoretical bounds helps to advance practical applications.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• One of the weaknesses of this work is regarding the use of the dataset. In medical imaging, we often see variations in the performance as we go from one dataset to another. It would have been better if the paper focuses on another segmentation dataset as well. A suggestion could be an MRI dataset.

• Although the paper refers to “own challenges” in medical imaging, compared to other domains where these theoretical bounds are justified, the discussion regarding such challenges is minimal.

• Please rate the clarity and organization of this paper

Excellent

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Laudable.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
• The paper is well-written and picks an interesting and much-needed discussion around generalization.

• Flat Minima and their size: UNet is a very specialized architecture, and that’s why we often see really good generalization even with the small training set. In the text, authors attribute this to a constant grade throughout rather than the flat minima. Can such constant grade lead to a lack of robustness in comparison to the flat minima?

• Please state your overall opinion of the paper

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• Clarity.
• Good experiments.
• Limited methodological novelty.
• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

6

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Summary: Points out that many PAC-Bayesian frameworks focus on vacuous bounds that could be meaningless in medical imaging applications. The authors discussed non-vacuous bounds and assessed generalisation of U-Net segmentation results on a small medical imaging dataset.

Positives:

• Non-vacuous PAC-Bayesian bounds on generalization error has not been paid much attention in medical imaging analysis.
• Very concise and compact notation.
• For the two presented cases, the bound matches the actual performance quite well.
• Information provided to enable reproducibility.

Negatives:

• The bound itself is usual in ML, so may work may not be sufficiently novel. A rebuttal should justify whether the application really is a novel contribution to medical imaging, and it’s importance relative to unregularised methods, which appeared to perform better.
• A lot of focus on discussing previous works and not enough about the experimental work undertaken. This could be addressed at the rebuttal stage.
• Writing is confusing in places, with some useful details missing. Please try to clarify things.
• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

# Author Feedback

We thank all reviewers for the detailed feedback. We focus on major concerns due to space and have designed the Q&A pairs below for maximum coverage:

Q: Why is this novel to medical imaging?

A: PAC-Bayes (PB) methods for deep networks have never been applied to medical imaging datasets. In fact, bounds on performance, in general, have lacked discussion in medical imaging. Our paper stands to fill this gap. Even our baseline proposal – reporting a Hoeffding bound computed on a holdout set – is not commonly used in evaluation. Although, we think it (or a PB bound) should be used in high-stakes, low-data medical contexts. Further, as we are aware, non-vacuous PB bounds for deep networks have only been computed on MNIST, CIFAR-10, and ImageNet for classification in vision. Achieving positive results on a new domain (medical imaging) and a new task (segmentation) is not a given, thus, the results themselves are novel. Some experimental techniques are also novel. E.g., varying the prior variance to probe the loss landscape has not been done before as we are aware. Methodologically, we extend computation of PB bounds to networks with batch norm (UNet & ResNet); our approach avoids sampling any parameters for normalization. We do not yet provide details on mathematics for this, but plan to include this in the updated document.

Q: Advantage of PAC-Bayes (PB) against unregularized methods?

A: First, PB bounds are important because they help us understand what properties of SGD trained networks lead to good generalization. Unregularized methods are not as interpretable. Second, PB bounds allow us to train the network on all of the data and report a bound. In contrast, the unregularized methods cannot train on all the data and still report a bound. The methods are comparable for macro estimates of error in the results, but having access to all the data can be especially useful for datasets with rare presence of a particular disease.

Q: Relation between PAC-Bayes (PB), prior variance, and flat-minima?

A: In PB, the model is stochastic: each time we do inference, we sample from a normal distribution. Because the normal distribution has some variance, we are sampling network weights in a region around the mean. Thus, when performance of the stochastic model is good, there must be a non-negligible area around the mean where most networks perform well, i.e., a flat minima around the mean. We know the variance is non-negligible in the posterior network because the tight bounds imply the posterior has small kl-divergence with the prior and the prior itself has non-negligible variance (we pick and modulate this value). We will clarify this further in Section 2.3.

Q: Do data-dependent priors violate assumptions?

A: The prior must be selected independently of the data used to compute the bound. The bound holds for all posteriors independent of “how” the posterior is selected. Thus, it is correct to use 50% of the data to learn a prior, 100% of the data to learn the posterior, and the other 50% of the data (not used to learn the prior) to compute the bound.

Q: Details on lower bounds, Hoeffding bound, bound parameters, and the method overall? Can this replace the focus on related works?

A: Lower bounds (e.g., for dsc) follow from 1-dsc <= x implies dsc >= 1-x. The Hoeffding bound follows from Hoeffding’s inequality, we will include a statement of this. Bound parameters are contained in the bolded paragraph “Bound Details.” We will provide relevant flow-charts & more method details as suggested. We will make discussion of related works more compact to allocate these details. The present focus is meant to give context for this novel area which may be unfamiliar. As noted, it has not been used in medical imaging contexts.

Q: More datasets and experiments?

A: We can move the classification experiments from the appendix to the main text and are happy to try other datasets as future work if there are suggestions.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Novelty concerns are addressed because this would be the first medical imaging paper to evaluate PAC-Bayes on deep networks. Other reviewer concerns also seem to be addressed, including what seems to be a misconception regarding data-dependent priors violating assumptions. One criticism was that more experiments might be needed, so the authors plan to move some work from supplementary material into the main body of the paper.

I’m not sure so many at MICCAI are likely to understand the motivation for the work. For the MICCAI community, the “thorn in the side” comes from the empirical observation that small training sets lead to poor results on test sets. Few would be at all concerned about bounds on generalization error usually being so large for deep networks that they are vacuous.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Comments on this work concern the novelty and contribution with respect to the medical imaging domain, as well as the discussion of the results obtained with respect to unregularized approaches. Reviewers generally agree on the lack of clarity, especially concerning the experimental part. The manuscript focuses excessively on the general description of PB, at the detriment of clarity on the actual contribution. The rebuttal emphasises that the reported PBB for the segmentation task and for the proposed models is novel and non-trivial. Moreover, it is stressed that the use of PBB can lead to better interpretability and generalisation with respect to standard approaches. Further details are finally provided on the role of variance, and on the overall framework.

The merit of this work is in introducing and exploring the use of PAC-Bayesian framework in medical imaging applications. This is a meaningful contribution, as the potential for PBB is large. The paper appears dense and the current structure does not help the reader to grasp important details. Still, the merit and novelty of this contribution seem to outweigh the downsides, and have the potential to stimulate meaningful discussions during the conference.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper applies existing theoretical tools to medical image segmentation for estimating performance lower bounds. The novelty is not high as also pointed out by the reviewers and the meta-reviewer. However, I agree with the authors that the results for medical image segmentation are novel. Furthermore, I do not think providing a lengthy background for this paper is a bad idea. This is not the usual MICCAI paper but I think it should be. I agree with authors comments in the rebuttal that performance guarantees should be considered a lot more in our community.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6