Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Agostina J. Larrazabal, César Martínez, Jose Dolz, Enzo Ferrante

Abstract

Despite the astonishing performance of deep-learning based approaches for visual tasks such as semantic segmentation, they are known to produce miscalibrated predictions, which could be harmful for critical decision-making processes. Ensemble learning has shown to not only boost the performance of individual models but also reduce their miscalibration by averaging the independent predictions. In this scenario, model diversity has become a key factor, which facilitates individual models converging to different functional solutions. In this work, we introduce Orthogonal Ensemble Networks (OEN), a novel framework to explicitly enforce model diversity by means of orthogonal constraints. The proposed method is based on the hypothesis that inducing orthogonality among the constituents of the ensemble will increase the overall model diversity. We resort to a new pairwise orthogonality constraint which can be used to regularize a sequential ensemble training process, resulting on improved predictive performance and better calibrated model outputs. We benchmark the proposed framework in two challenging brain lesion segmentation tasks –brain tumor and white matter hyper-intensity segmentation in MR images. The experimental results show that our approach produces more robust and well-calibrated ensemble models and can deal with challenging tasks in the context of biomedical image segmentation.

SharedIt: https://rdcu.be/cyl4Q

https://www.med.upenn.edu/cbica/brats2020/data.html

Reviews

Review #1

• Please describe the contribution of the paper

The paper introduces orthogonality constraints and a corresponding learning strategy to improve calibration and segmentation performance of an ensemble of deep neural networks. Orthogonality constraints are used to decorrelate layers within a given model as well as among different base models in an ensemble. Such constraints enable the sequential learning of more diverse (i.e. decorrelated) base models. Experimental evaluation on BRATS 2020 dataset and a WMH dataset demonstrate a clear improvement in calibration and segmentation performance.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

• Relevance: the authors propose a method for improving calibration of deep neural network. Clinical decisions inevitably depend on well-calibrated probabilities. Calibration of deep neural networks is in my opinion still an understudied topic and better methods are desperately needed. • Novelty: the introduction of orthogonality constraints to enforce diversity among base models in an ensemble appears to be sensible and novel. • Breadth: in addition to introducing orthogonality constraints, the authors provide a strategy to train the base models of the ensemble, evaluate their approach on two relevant medical imaging segmentation tasks and chose meaningful baseline methods to experimentally prove their point.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

• Impact on model weights: While the authors provide results on calibration and segmentation performance, an investigation on the impact of the decorrelation constraints on the actual model weights is missing. I would have liked to see a visualization (e.g. in Supplementary Materials) which confirms that the constraints are inducing the desired outcome (decorrelation). (see also detailed comments)

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code is explicitly provided. However, the proposed meta-learning algorithm is very well explained. Therefore, I think it reasonable to say that based on the paper alone (and maybe some correspondence with the authors) one could easily reproduce the method.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

• From my point of view, the hypothesis presented in this paper consists of two parts: i) orthogonality constraints yield more diverse (i.e. decorrelated) models, ii) diverse models yield better calibration and segmentation performance. The authors clearly succeeded in proving part ii). While orthogonality constraints have been studied previously and their impact of filters has been presented, I would say that a verification of part i) is still necessary (at least provided as Supplementary Material). • Regarding the inter-orthogonality term, I am wondering if the restriction of the computation to a given layer l is sensible (apart from practical reasons). It is not obvious to me if the same “feature” is encoded for a given layer l among multiple models. Therefore, I am curious if it makes sense to compute an orthogonality for models as a whole rather than in a per-layer manner? • The sequential learning of “informed” models reminds me of traditional boosting methods (e.g. AdaBoost). In the case of boosting, the “informing” is conducted via sample weighting. In the case of the author’s methodology, it is conducted via the inter-orthogonality terms. From a theoretical point of view, it would be very interesting to work out if a connection of the author’s methodology with boosting exists. • On the BRATS dataset, the 5-net ensemble with self-orthogonality performed worse than the ensemble with random initialization. I wonder why?

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty. Breadth of the paper including a very convincing experimental evaluation.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Very confident

Review #2

• Please describe the contribution of the paper

Deep learning models are well-known for its overconfident predictions, which make them unsuitable for applications that require good probability estimations. A simple solution is to use an ensemble of deep models and average their predictions. But, for an ensemble to work properly, its components need to be decorrelated. The paper proposes a method to enforce the diversity in ensembles of deep neural networks by training the networks sequentially and encouraging that the weights of the filters of each network are mutually orthogonal with the weights of the filters at the same layer of all previously trained networks. While it is unclear if this orthogonality is leading to actual model diversity in the ensembles, it is shown experimentally that the performance of an ensemble improves, sometimes by a large margin, when trained with the proposed orthogonality loss.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

S1. The method is simple, clean and reasonable. The related work section seems relevant. The paper is well written, concise and very clear.

S2. While orthogonality losses have been used before, I am not aware of any previous method that applies it to encourage model diversity in ensembles. To the extent of my knowledge, this is a novel contribution.

S3. The experimental section is convincing and shows clear advantages of the proposed method.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I found a few minor issues:

W1. The paper does not analyze the complexity of computing the orthogonality loss, O(Nn^2), which seems to be quite high. How long does the training take compared to the training of the random ensemble?

W2. The hypothesis of the paper is that the orthogonal constraints will bring diversity to the ensemble. This is not shown or proved. Is the increase in performance a consequence of a more diverse ensemble, or a consequence of better individual models obtained thanks to the orthogonality term?

W3. How sensitive is the method to the hyperparameter lambda? A sensitivity analysis would be very helpful, as methods with high sensitivity to the hyperparameters are harder to use in practice.

W4. How does the proposed method compare to other methods to produce ensembles of networks, such as monte-carlo dropout or boosting? I miss a comparison to alternative previous methods to clarify where the proposed method stands.

W5. No qualitative results shown.

W6. It is not completely clear to me how the inter orthogonality loss is applied in the 1-Network case of Figures 1 and 2. If I understood correctly, an ensemble of 10 models is trained and then one of them is randomly selected for evaluation?

• Please rate the clarity and organization of this paper

Excellent

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I have no concerns regarding the reproducibility.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I suggest that authors address my weaknesses to make the paper stronger. For W1, a training time comparison would suffice. For W2, authors could show the variance in the predictions of the components of the ensemble with and without the orthogonal losses. If the hypothesis is correct, the variance of the ensemble should increase when the new losses are used. For W3, a plot of performance vs. lambda would help to understand its influence. For W4, authors should add at least one additional baseline of ensemble of networks. Also, the paper would benefit from some qualitative results (W5) that show examples where the proposed method performs better than baselines, and a few failure cases.

Two minor typos:

1. Page 3, end of first paragraph: “…in the performance on the benchmark , “ Remove the space before the comma.

2. Page 8, second paragraph: “less than $1e^{-3}$”. This should be $10^{-3}$ (or $1e-3$, but the former is more elegant).

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think this is a good paper, and I consider the weaknesses listed above to be minor. Even considering the possibility that a comparison with another baseline (W4) showed the proposed method as inferior, the paper still introduces an elegant and theoretically sound method and presents convincing experiments that will be interesting for the community.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

4

• Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

A solid and well demonstrated work that provides an interesting solution to an important problem. Suggestions proposed for additional experiments and improved clarification will certainly be beneficial to the future development of the paper

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

N/A