Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Holger R. Roth, Dong Yang, Wenqi Li, Andriy Myronenko, Wentao Zhu, Ziyue Xu, Xiaosong Wang, Daguang Xu

# Abstract

Building robust deep learning-based models requires diverse training data, ideally from several sources. However, these datasets cannot be combined easily because of patient privacy concerns or regulatory hurdles, especially if medical data is involved. Federated learning (FL) is a way to train machine learning models without the need for centralized datasets. Each FL client trains on their local data while only sharing model parameters with a global server that aggregates the parameters from all clients. At the same time, each client’s data can exhibit differences and inconsistencies due to the local variation in the patient population, imaging equipment, and acquisition protocols. Hence, the federated learned models should be able to adapt to the local particularities of a client’s data. In this work, we combine FL with an AutoML technique based on local neural architecture search by training a supernet’’. Furthermore, we propose an adaptation scheme to allow for personalized model architectures at each FL client’s site. The proposed method is evaluated on four different datasets from 3D prostate MRI and shown to improve the local models’ performance after adaptation through selecting an optimal path through the AutoML supernet.

SharedIt: https://rdcu.be/cyl4h

# Link to the code repository

https://monai.io

http://medicaldecathlon.com

https://promise12.grand-challenge.org

http://doi.org/10.7937/K9/TCIA.2015.zF0vlOPv

# Reviews

### Review #1

• Please describe the contribution of the paper

This paper combines FL with an AutoML technique based on local neural architecture search by training a “supernet”.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Clear explanation; Clear struture; Comprehensive experiment;

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Miss important reference Limited novelty

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Easy to reproduce

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The main problem is the missing of two related works as follows: Real-time Federated Evolutionary Neural Architecture Search (https://arxiv.org/pdf/2003.02793.pdf), Towards Non-I.I.D. and Invisible Data with FedNAS: Federated Deep Learning via Neural Architecture Search (https://arxiv.org/pdf/2004.08546.pdf) Note that the two works first appear online in April 2020.

Though the task of this paper is segmentation, which is different from the classification task in the above two works, the main idea is the same.

I still like this paper as it is the first work that combines FL and NAS for segmentation which is a important task in the medical domain.

I suggest the authors to include the mentioned two works and give sufficient discussions to illustrate the difference.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty, Experiment

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

4

• Reviewer confidence

Confident but not absolutely certain

### Review #2

• Please describe the contribution of the paper

The work proposes an autoML approach for multiple datasets in a federated learning manner. Once the server-side supernet is trained via a federated learning framework, the subnetworks for the local datasets are further adapted by finding optimal structures.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The results show some improvements.
2. Trying to find optimal architecture for each dataset based on the server-side model is an interesting idea.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Are the different path combinations explicitly explored? How you ensure that the paths are properly considered throughout the training?
2. Some of the domain adaptation examples are unsupervised domain adaptations where the sample labels are not provided during training. The proposed work is closer to transfer learning which performs a training step on a target dataset eventually without any constraints.
3. It is very unclear how the paths are chosen. It appears that this is a combinatorial process. I am also not sure what the path weights are.
4. It would have been interesting to see the different paths taken by the datasets.
5. If the subnetworks are allowed to be updated via another optimal path search, could the subnetworks be allowed to be updated directly on their weights instead? This seems to be a crucial baseline that is missing. In other words, U-Net + adapt (via weight updates) and even SN + adapt (via weight updates) seem to be fair game.
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No codes are provided, and several algorithimic details are missing.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

My comments are in the weaknesses section. I would greatly appreciate it if the authors could answer those questions.

borderline reject (5)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the work tackles an important problem, each of the contributions is not well motivated. Federated learning is mostly for formulating a hypothetical situation, which I agree is an important problem. The AutoML is simply used as a way to locally adapt the subnetworks, but the proposed path-finding method is weakly motivated and lacks details to understand the process. The “domain adaptation” is a natural byproduct of the federated learning framework with a local update that does not seem to be specific to the proposed optimal path-finding autoML.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

This paper presents federated learning as a solution to GDPR concerns regarding medical datasets. The Deep-learning model is thereby trained locally on the datasets of the particular hospital / research unit and only the model parameters are shared. Thus, specific aspects of the local datasets such as the image acquisition protocol and so on are implicitly conserved. Focus on AutoML for setting up the network architecture in specific diagnostic domains.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Introduction to AutoML, Federated Learning and Domain adaption are nice prepared

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Many, many basics such as the average loss or cross-entropy explained. There, the paper could be more compact, referencing “well known facts” and focusing on the specific contribution with intensified testing and evaluation. Unclear, how uniqueness of the datasets over hospital borders are guaranteed. Often local test images get enriched by public available datasets which could lead to a massive bias, e.g. if several utilize MedDecathlon and so on, too. Setup with dataset split for simulating local learning rather a lab-setup. In reality, heterogeneity will be higher. Thus, the research works needs to prove in reald-world scenarios too. General approach hard to deploy as product with hospital-centric supermodel, AutoML and Federated learning for locally optimized models. So much harder to use in practice compared to a locally trained and then shipped model.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

well known and public datasets are utilized: MSD-Prostate1, PROMISE, NCI-ISBI

but as source is not available and the paper is lacking details on implementation. Some of the optimizer configurations are stated but statements such as “Augmentation techniques like random intensity shifts, contrast adjustments, and adding Gaussian noise are applied during training to avoid over tting to the training set.” do not allow to reproduce anything

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Many, many basics such as the average loss or cross-entropy explained. There, the paper could be more compact, referencing “well known facts” and focusing on the specific contribution with intensified testing and evaluation. Unclear, how uniqueness of the datasets over hospital borders are guaranteed. Often local test images get enriched by public available datasets which could lead to a massive bias, e.g. if several utilize MedDecathlon and so on, too. Setup with dataset split for simulating local learning rather a lab-setup. In reality, heterogeneity will be higher. Thus, the research works needs to prove in reald-world scenarios too. General approach hard to deploy as product with hospital-centric supermodel, AutoML and Federated learning for locally optimized models. So much harder to use in practice compared to a locally trained and then shipped model.

Easy to follow – maybe too much basics presented.

Good tests highlighting the potential for generalization.

scientific novelty: Some level of innovation. Well known strategies such as AutoML and Federated learning are applied.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

solid work, good structure, lacking some details for reproducibility, basic stuff deineated

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

4

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers agree on the intrest of this work in combining NAS and domain adaption for FL applications. Concerns are expressed with respect to the lack of clarity. While basic notions are generally provided with high detail, the motivation and several crucial aspects of the proposed methodology are not clearly illustrated (R2 and R3). In particular, the combinatorial aspect of the optimization scheme is not clear, while the comparison with respect to baseline approaches seems lacking (R2).

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

# Author Feedback

We thank the reviewers for their valuable comments. All reviewers agree that the work is of interest and comment on its relevance for medical applications with regard to regulatory restrictions such as GDPR (R3). R1 and R3 see the work positively, commenting on it being “novel” and a “solid work”, while R2 is mainly asking for some clarifications regarding the method and implementation.

We are happy to address these comments in the rebuttal and integrate the clarifications into the camera-ready version:

“1. Are the different path combinations explicitly explored?…” During training, we randomly select a path through the supernet at each local mini-batch iteration at a client. We monitor convergence on randomly chosen paths sampled from a uniform distribution during each validation to determine when the supernet is sufficiently trained across clients. As stated in the paper, our supernet has (3^3)*(4^6) = 110,592 possible paths. We selected the number of training iterations such that the likelihood of a path being selected during the entire training is at least >1. However, the length of training was mainly driven by the convergence behavior mentioned above.

“2. … unsupervised domain adaptations (…). The proposed work is closer to transfer learning…” We agree with this comment in the sense that our domain adaptation step is supervised. However, transfer learning would typically imply that a new dataset is being used to transfer to. In our setting the “domain adaptation” step doesn’t require any new data. We are using the existing training and validation data on a client to find the locally-best path through the supernet. Therefore finding a locally optimal network architecture that is personalized to a particular client. Note that all results reported in the paper are based on independent test splits not used for training, validation, or the adaptation step.

“3. It is very unclear how the paths are chosen…” Please see our comment above. Paths are uniformly sampled during training (there are many different possible combinations of paths). The final “personalized” pathways are chosen by the described domain adaptation step.

“4. … see the different paths taken by the datasets.” Thank you for the comment. We can visualize/list the pathways for each client and provide supplementary material. In descending order, most commonly chosen operations were 3D conv., 3D residual block, 2D conv., followed by identity.

“5. .. [update networks] directly on their weights instead? …” Thank you for the suggestion. Our current work focuses on the model architecture personalization aspect. Hence, during the domain adaptation step, the supernet layer weights stay fixed (are not optimized) but only the path weights are being optimized. Further fine-tuning of the network weights (not the path weights) is likely going to give performance boost on a local client but is also expected to reduce generalizability of the model. Methods of fine-tuning that do not reduce the robustness to other data sources (i.e. generalizability) gained through FL (e.g. learning without forgetting) is still an open research question and was deemed to be out of scope of this work.

We hope these answers the unclear points and further motivates our choice for client-specific and personalized model architectures. As pointed out by R2, further local fine-tuning of the locally optimal architectures would likely be beneficial to a specific client-site but might come with reduced robustness to other sites. We would also like to note that FL has been used in several healthcare and medical imaging related real-world studies and we disagree with it being a “hypothetical situation”.

We will work on integrating these clarifications and other constructive feedback (such as the missing references mentioned by R1 and reducing on basic formulations as pointed out by R3 and AC) into the final manuscript.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Reviewers and AC found the proposed application novel and relevant, and the experimental assessment comprehensive. The main criticisms concern novelty and missing comparison with baseline works, the lack of clarity on the model, and reproducibility. The rebuttal mainly focuses in clarifying the optimization scheme for the proposed architecture, especially for the AutoML part. The most critical aspect seems related to the combinatorial aspect of the optimization procedure. Further explorations seem required to justify the relevance of the method, for example concerning training stability, and sensitivity analysis. In this sense, including a comparison with respect to simpler adaptation frameworks seems also important. Moreover, the positionining with respect to previous approaches combining FL and NAS is necessary. Nevertheless, the idea proposed in this work is original and the application relevant enough to deserve discussion during the conference.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes to add an adaptation scheme to federated learning across multiple client sites. Evaluation is on 3D prostate MRI on four datasets.

One reviewer finds the paper and evaluation clear, but questions novelty to classification papers.

A second reviewer questions motivations and novelty on each proposed components (FL+AutoML), as well as technical details on branch setups.

A third reviewer questions on real scenarios, manuscript structure.

The general consensus is an appreciation of combining Federated Learning with Neural Architecture Searches by training a supernet. There are manuscript rebalance on providing more details on the true contribution, rather than on common-knowledge terms. I believe these would be realistic for a final minor revision to improve the work, as promised in the rebuttal.

For these reasons, Recommendation is toward Acceptance.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper brings together several existing ideas to propose a solution to federated learning for medical image segmentation. The reviewers were generally positive about the paper but pointed out limited novelty and other question including choice of methods used as baselines. Even after reading the rebuttal I am not entirely clear about the argument why unsupervised domain adaptation rather than transfer learning should be used as the baseline. However, considering the contribution of exploring federated learning in the context of segmentation, I agree with the overall reviewer sentiment that this paper could be valuable to the MICCAI community.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10