Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Lucas Fidon, Michael Aertsen, Doaa Emam, Nada Mufti, Frédéric Guffens, Thomas Deprest, Philippe Demaerel, Anna L. David, Andrew Melbourne, Sébastien Ourselin, Jan Deprest, Tom Vercauteren

# Abstract

Deep neural networks have increased the accuracy of automatic segmentation, however their accuracy depends on the availability of a large number of fully segmented images. Methods to train deep neural networks using images for which some, but not all, regions of interest are segmented are necessary to make better use of partially annotated datasets. In this paper, we propose the first axiomatic definition of label-set loss functions that are the loss functions that can handle partially segmented images. We prove that there is one and only one method to convert a classical loss function for fully segmented images into a proper label-set loss function. Our theory also allows us to define the leaf-Dice loss, a label-set generalisation of the Dice loss particularly suited for partial supervision with only missing labels. Using the leaf-Dice loss, we set a new state of the art in partially supervised learning for fetal brain 3D MRI segmentation. We achieve a deep neural network able to segment white matter, ventricles, cerebellum, extra-ventricular CSF, cortical gray matter, deep gray matter, brainstem, and corpus callosum based on fetal brain 3D MRI of anatomically normal fetuses or with open spina bifida. Our implementation of the proposed label-set loss functions is available at https://github.com/LucasFidon/label-set-loss-functions

SharedIt: https://rdcu.be/cyl3b

# Link to the code repository

https://github.com/LucasFidon/fetal-brain-segmentation-partial-supervision-miccai21

https://zenodo.org/record/4541606#.YObzzHWYVhE

http://crl.med.harvard.edu/research/fetal_brain_atlas/

# Reviews

### Review #1

• Please describe the contribution of the paper

This paper addresses the problem of supervised deep learning for multi-label image segmentation from partially labelled data. It makes three main technical contributions: 1) introduces label-set loss functions and an axiom that they must satisfy to guarantee compatibility across label-sets and leaf-labels; 2) introduces leaf-Dice that is a generalization of Dice loss for missing labels; and 3) demonstrates that there is one and only one way to convert a classical segmentation loss for fully supervised learning into a loss function for partially supervised learning that complies with the introduced axiom. Leaf-Dice was used to segment 8 tissue types in fetal MRI scans, where it was shown to outperform three other methods.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is a solid work with sound theoretical development and proof for a generalization of promising loss functions to train deep learning models with partially (missing) labeled data.

The application in fetal MRI tissue segmentation showed encouraging results compared to three other methods. Good performance on an abnormal brain was interesting; although hydrocephalus is considered relatively easy. It could be much more interesting to test a case with agenesis of corpus callosum and see if the method would find a corpus callosum or not.

The paper was clear, well-motivated, and balanced very well between theory and application.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some structures such as corpus callosum might be hard to segment on fetal MRI but very low Dice values for structures such as brainstem and cortical gray matter, especially with the baseline methods, is concerning (Table 2). It seems recent literature on automatic fetal brain MRI segmentation (with deep learning or atlas-based segmentation) suggests higher Dice values for those structures.

This is not a weakness per se, but it seems the technique can be for any segmentation task. The focus on fetal in the title and the opening paragraph of the introduction may thus reduce the broader impact of the work; however, I suppose a journal version extension with more applications will mitigate this potential problem.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

If the programs are not released, it may be hard to reproduce the methods.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I am not sure if it is best to refer to the multi-label Dice generalization in [10] as mean Dice or generalized Dice. It seems generalized Dice was used in the title of [10] and in literature that used it.

The paper may benefit from a discussion comparison to state-of-the-art in fetal MRI tissue segmentation, for example on fetal cortical plate segmentation. Without such discussion it may mis-represent the state-of-the-art performance in tissue segmentation in fetal MRI application.

The abnormality chosen was interesting. The test could potentially be extended to more challenging abnormalities such as callosal agenesis or cortical plate malformations.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this is a well written paper that includes solid theoretical work on multi-label segmentation from partially labelled training data applied to an interesting application. The paper may be of general interest to MICCAI audience and attendees.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

6

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

This paper presents a loss function to address learning from training data in which only a subset of labels are available. Rather than assume labels are independent, this allows the consideration of more realistic models of anatomy. Loss functions are experimentally evaluated in the setting of learning a classifier from partially or fully labeled data, and are assessed based on final image segmentation quality.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents an interesting formulation of a loss function, and a careful assessment of its utility in comparison to alternative strategies. The results demonstrate the method is effective in increasing the accuracy of segmentation of fetal brain MRI from partially labeled training data.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some aspects of the description of the loss function remain unclear. Additional description of the notation would help to clarify the description.

The evaluation requires that aspects of the training that are held constant, such as a batch size of 3, be optimal for all loss functions. This is assumed but not demonstrated. It may be that hyper parameters related to training interact with the loss function to change performance.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The only mention of reproducibility is a footnote that references the miccai checklist in the context of describing a private dataset. It seems this data are not able for replication experiments.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The primary weaknesses are the description of the rationale for the loss function, and the experimental nature of the evidence in favor of the loss function. The rationale for the loss function, and related loss functions, is described in Sections 2.1, 2.2 and 2.3. It is described that the function \Phi is not unique, and the particular form of \Phi presented between Equations 1 and 2 and Equations 5 and 6 is motivated by popularity’. The circumstances under which this is the optimal choice or a favorable choice are not described.

The paper describes that partial ground truth segmentations may be considered as being drawn from the set of all partial label sets, that is, the power set of the label set L. In turn some partial labelling is observed at each of N voxels. It is less clear if this model is intended to account for labels that are correctly not observed at some voxels because they would be erroneous labelings, or if the possibility of wrong or `noisy’’ labelings is explicitly accounted for. The paper would be strengthened by further commentary on the reason for the incomplete labelings, and whether every labelling is regarded as equally informative.

The utility of the proposed loss function is examined by experimental comparison to the accuracy of segmentations predicted from training the same network with different loss functions. These experiments suggest the loss function is effective at increasing the quality of the predicted segmentations. However, there are many experimental parameters associated with the optimization of the network parameters, and it is not clear if the same utility of the loss function would appear for different selections of hyper-parameters.

These considerations are minor and don’t detract from the demonstration that in practice increased segmentation quality is possible in this way. Future work may explore some of these open questions or further generalize the approach to more scenarios for segmentation labelling.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an interesting concept and a reasonable validation study to suggest there is substantial practical merit in the approach. The scenario of supervised learning from partially labelled data is a common one, and the technique may be widely applicable to this type of problem.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

1

• Reviewer confidence

Very confident

### Review #3

• Please describe the contribution of the paper

In this paper, the authors state the first axiomatic definition of loss functions that can handle partially segmented images. Moreover, they introduce the leaf-Dice, a generalization of the Dice loss suitable for partial supervision with missing labels. Finally, they propose a semi-supervised approach for fetal brain 3D MRI segmentation showing competitive performance in a multi-center dataset.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-The first axiomatic definition of label-set loss functions is proposed. -The leaf-Dice is introduced as loss function for cases with partial labels. This loss function could be easily applied also to other image segmentation tasks. -The first application of a partially supervised framework to fetal brain 3D MRI segmentation is proposed. For this scope, a large number of partially labeled cases is used, and the results obtained significantly outperform a fully-supervised baseline.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-Several claims are not justified or proven in the manuscript. -The evaluation proposed does not seem strong enough for several reasons. First, a 5-levels 3D U-Net is trained as supervised-baseline with 18 cases. This is a very limited number of labelled images and very likely overfits the training dataset. Second, no state-of-the art supervised method is considered for comparison. Third, the authors consider the publicly available FeTA dataset for testing. However, from this dataset they select 30 out of 40 cases based on a quality control. Given that the manual segmentations are provided, there is no reasons why these cases should be excluded. -The statistical analysis provided is not convincing and is missing the comparisons between the semi-supervised methods.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code will be released publicy, however, the private dataset will not be made available.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

-The space of the manuscript should be better organised. In the first half of the paper the authors constantly refer to the Supplementary material for the proofs of their claims. As a reminder, the paper should include all the important information by itself, and therefore at least some of these proofs should be included in the main manuscript. -Excluding 10 cases from a publicly available dataset providing the segmentations does not seem fair. Why not splitting the results based on the quality control performed on this dataset? Otherwise healthy subjects only could be included from this dataset. -The evaluation of at least one state-of-the-art supervised method should be included. -The comparison between the supervised baseline and the semi-supervised methods proposed does not seem fair. The semi-supervised methods have 9 times more training data than the supervised one. This should at least be acknowledged and discussed. -Given the limited number of training data for the supervised 3D U-Net, a shallower network should be tested. The 5-levels 3D U-Net proposed usually needs more labeled cases to achieve competitive results. -The statistical analysis should include comparisons also between the semi-supervised methods rather than only with the baseline (Table 2). -Page 7: “The qualitative results of Fig. 2 also suggest that Leaf-Dice performs significantly better than the other approaches”. This should be rephrased as we cannot infer statistically significant results from the visual analysis of a single image.

borderline reject (5)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors propose several novel ideas and the first semi-supervised method for fetal brain MRI segmentation is introduced. However, the manuscript should be better organised as several proofs are missing. Moreover, the evaluation performed has some weak points and it lacks a comparison with a state-of-the-art method. This should be improved for accepting the manuscript.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a label-set loss function, a generalization of Dice for missing labels and the applies this framework (for the first time) to brain tissue segmentation of fetal brain MRI with missing labels with a semi-supervised approach, in a combination of publicly available and in-house datasets. The reviewers and this meta-reviewer acknowledge the novelty and value of this work, that can be very interesting beyond the fetal MRI application. However, I would like the authors to clarify/justify some points.

As pointed out by R1, the Dice values reported for baseline are low (brainstem and cortical GM for instance) and reported values in SOTA methods in fetal brain segmentation (DL or atlas-based techniques, see works published by Gholipour et al or by Payette et al) suggest higher performances. Could the authors discuss these differences, particularly when FeTA dataset is used ? This work would have gain of including a SOTA method for baseline comparison.

A little bit related (maybe) are the concerns of R3 about the 1) comparison is confounded by the fact that semi-supervised approaches have 9x more dataset for training? Could the authors discuss/justify this?

Also, the concerns on training a 3D UNET with so few cases raised by R3 are important to be clarified, a shallower architecture might be indeed more adapted. In fact, why excluding 10 cases from the FeTA dataset ? Could the authors justify this? These cases have been all useful to be included in training as int the reference FeTA paper.

R2 raises also important aspect as regard the possible interactions between loss function and hyperparameters might lead to different results. Could the authors discuss about this point?

I would also recommend the authors to discuss a little bit more on the already existing techniques (eg. also those related to cortical plate) as suggested by R1.

Please remind that the main purpose of this rebuttal is to provide clarification or to point out misunderstandings, and include new details that can better highlight the value of this work. I will not consider any promise of adding future experiments and results.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

# Author Feedback

We thank the reviewers and the meta-reviewer for their time and their feedback.

COMPARISON TO SOA ON FETAL BRAIN SEGMENTATION For cortical gray matter, deep gray matter, and brainstem the only testing MRIs that are annotated are the ones from the FeTA dataset. Those MRIs are out of distribution for all the deep learning methods since they were acquired at a different center than the MRI used for training. This explains why the dice scores are low compared to the literature in which the evaluation is done on MRI from the same center as the MRI used for training or to build the atlas.

We are currently working on a journal paper with more comparisons between deep learning and atlas-based segmentation methods. Preliminary results suggest that atlas-based methods tend to suffer less from out-of-distribution gaps.

COMPARISON TO FULLY SUPERVISED LEARNING 3D U-Net with 5 levels or more has been shown to be state-of-the-art for 23 medical image segmentation datasets in [1]. This includes tasks with as little as 20 training volumes. The common way to tackle this issue, as in [1], is to use data augmentation as we did (see section 4).

The goal of this work is not to set a new state of the art in fully supervised fetal brain parcellation, but to show that partial supervision can be used successfully. The fully supervised learning approach is expected to provide an upper bound for segmentation accuracy if all the tissue types are segmented for all the training volumes.

The 18 volumes of the Harvard atlas, used for training the fully supervised baseline, are the only MRIs that are fully segmented in our training set (see Table 1).

The 18 volumes of the Harvard atlas represent average fetal brains. This might explain the limited generalization to real fetal MRIs. Given this domain gap, we think that the fully supervised baseline achieves decent segmentation performance (Table 2).

EXCLUDED VOLUMES The 40 MRI with manual segmentations of the FeTA dataset [2] (first release) were inspected by two pediatric radiologists with more than 8 years of experience in segmenting fetal brains.

They found and corrected some inconsistencies in the manual segmentation with respect to the segmentation guidelines used in the FeTA dataset paper [2]. They also excluded 10 MRIs, as the quality of the 3D MRI has been rated as insufficient to produce reliable manual segmentation (see section 3).

We have now evaluated the Dice scores and Hausdorff distances (with no label refinement) for those 10 volumes. The rank of the methods remains the same.

Including those 10 MRIs for training (rather than excluding them from the test set) like in [2] may help. We choose to not use those 10 volumes for training to be able to evaluate segmentation accuracy on 3D MRI from a center not seen by the networks during training.

HYPERPARAMETERS To avoid a bias towards our method, the learning rate was tuned for the fully supervised baseline trained with the Dice loss and Adam (see section 4). The batch size was chosen to be as high as possible. We did not try other values and we acknowledge that it might be suboptimal for the other loss functions, including our proposed Leaf Dice.

Adam has been shown to be robust to the exact choice of hyperparameters as compared to other optimizers [3]. In addition, all the loss functions that we compared have the same range of values since they are all based on the Dice loss. It suggests that the optimal learning rate should be similar for all the loss functions that we compared.

MATHS (R3) All our mathematical claims are proved in the supplementary material. This follows the authors guidelines: “Authors will be able to submit supplementary materials in the form of […] proof of equations at the time of paper submission”.

Statistical comparison between partially supervised methods is discussed in the text (please see section 4).

[1] doi.org/10.1038/s41592-020-01008-z [2] arXiv:2010.15526 [3] arXiv:2007.01547

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a novel approach to a very practical problem in medical image segmentation. All reviewers seemed to appreciate the premise of the paper, but raised some concerns about the evaluation. The rebuttal responded quite reasonably to the concerns.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All the main issues have been well addressed in the rebuttal.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4