Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Zhen Chen, Meilu Zhu, Chen Yang, Yixuan Yuan

Abstract

Nowadays, deep learning methods with large-scale datasets can produce clinically useful models for computer-aided diagnosis. However, the privacy and ethical concerns are increasingly critical, which make it difficult to collect large quantities of data from multiple institutions. Federated Learning (FL) provides a promising decentralized solution to train model collaboratively by exchanging client models instead of private data. However, the server aggregation of existing FL methods is observed to degrade the model performance in real-world medical FL setting, which is termed as retrogress. To address this problem, we propose a personalized retrogress-resilient framework to produce a superior personalized model for each client. Specifically, we devise a Progressive Fourier Aggregation (PFA) at the server to achieve more stable and effective global knowledge gathering by integrating client models from low-frequency to high-frequency gradually. Moreover, with an introduced deputy model to receive the aggregated server model, we design a Deputy-Enhanced Transfer (DET) strategy at the client and conduct three steps of Recover-Exchange-Sublimate to ameliorate the personalized local model by transferring the global knowledge smoothly. Extensive experiments on real-world dermoscopic FL dataset prove that our personalized retrogress-resilient framework outperforms state-of-the-art FL methods, as well as the generalization on an out-of-distribution cohort. The code and dataset are available at https://github.com/CityU-AIM-Group/PRR-FL.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_33

SharedIt: https://rdcu.be/cyl4g

Link to the code repository

https://github.com/CityU-AIM-Group/PRR-FL

Link to the dataset(s)

https://github.com/CityU-AIM-Group/PRR-FL

Reviews

Review #1

Please describe the contribution of the paper

The authors investigate the problem of federated learning on datasets with real-world distribution shifts (due to imaging devices etc.). They aim to improve upon other FL algorithms through model personalization, based on the observation that local model performance typically drops directly after each aggregation step (termed “retrogress” here). As the main contribution, a method is proposed that combines central model aggregation as a low pass parameter aggregation via the Fourier space of the parameters/ Conv kernels and model distillation techniques during local training. The method is evaluated in simulated FL on a multi-centric dermoscopic dataset and compares favorably to several baselines. OOD generalization on an unseen institution is also improved by the method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- As far as I know novel aggregation method in Fourier-transformed parameter space
- Well written and good to follow
- Comparison to several baselines performed
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Some details of the method are not stated clearly (duration of DET? FedBN?)
- Hyperparameter tuning method not described
- The Out-of-distribution Generalization part (Sec 3.4.) is not well explained. I.e. which model/ client model was used and how the datasets differ.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Positive:
- Hyperparameters seem complete
- Public datasets used
- Description of the study cohort (class statistics) included, but not complete (OOD experiments)
Negative:
- Not described how hparams were chosen and how sensitive to changes. How were baselines tuned?

Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

General:

Well written, mostly easy to follow (except some methods parts)

Clear structure and ablation studies; supplementary material with additional value

Introduction:

Given that the retroregress problem is given as the main motivation for the method, it should be backed up with additional references to support the generalizability of the method. Is the retroregress commonly considered a problem?

Methods:

Overview: Although all parts are mentioned, the overall process is not fully clear: Is DET followed by E epochs of training or is the whole local training E epochs long?

Fig. 2: Need to explain all symbols. In particular, d, F^A, F^P, M have not been introduced when the figure is first mentioned. The DET part is in my opinion not very helpful and the gray arrow confusing.

PFA: Explanation of the way the parameters are transformed into the Fourier representation is unclear: Why does it make sense to reshape the parameter matrix the way done here? Why is it different than in ref [9]?

In Eq (3), there is an iteration over K in the second term (while w_k is used beforehand). This can be very confusing, and it might help to use another counter instead of ‘k’ for the second term.

DET: since steps R&S are similar to model distillation, a reference would be adequate in my opinion. Equation 4: p(x) should rather be p(y x).

Is the full dataset used for the KL-div. in each step, as indicated by eq.4? In the “sublimate” step, it would help if you state explicitly that d minimizes L_CE1 and p minimizes L_p.

Experiments:

Table 1: Another column with information about the data source would help. If I see it correctly, A-C are HAM10K and D is MSK. For HAM10K, indication of the institution/imaging device would be good. As another “client” is introduced in 3.4, this split should also be included here.

Hyperparameters: how were they selected? How sensitive is the method to r’s and lambdas? How were the baseline hyperparameters (if any) tuned?

How long does DET take (if it is done prior to E local epochs)?

Imbalances: Since the clients (esp. B) show quite some heterogeneities with respect to the class statistics, a short discussion would be in order. Does especially AUC eliminate impacts of class imbalances and does it make sense to take the mean over different AUCs (with different class statistics) ?

Table 2: The results for FML and “ours w/o PFA” appear to be very close (also in the Supplements). Thus, it would be interesting to investigate the effects/ benefits of DET vs FML in combination with PFA.

It would be good to give variances as given in the suppl. if multiple runs are done.

Table 2: For the AVG, the authors probably use the mean of the per-institution metrics and do not recompute the metrics over the whole dataset (i.e. the presented AVG is not weighted by the number of samples). It would be great if the authors could state that and explain their reasoning behind that.

Ablation study: You mention that your method is identical with FedBN except PFA and DET. So do you also have personalized BN statistics? This should be stated in the method description.

OOD generalization: Where is the analysis for stat. Significance? Maybe add an upper baseline trained on the unseen client. Which client model was used ?

Typos:

Conclusion last sentence: insert “a” before “real-world dermoscopic” and a out-of-distribution -> an out-of-distribution

Suppl:

The performance of the methods seem to get worse with more frequent updates (Suppl. Fig 1). This appears to be a bit counter-intuitive as more information can be exchanged (and an update after each batch should equate to normal grad decent with larger batch size…)

Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, a well written paper that shows a novel approach and also shows clear improvements to the compared approaches. Some inaccuracies and minor issues could be discussed and better addressed.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper proposes a novel model aggregation technique based on transforming the model weights to the frequency domain, which could alleviate issues stemming from unique weights in individual model parameters because of data heterogeneity.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method could adequately address the “retrogress” problem in FL, which hinder local model training and good global model aggregation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Although I agree with the authors that retrogress is a valid concern in FL, in practical terms, if appropriate data harmonization is done correctly and a model trained on publicly available data (for the problem at hand) is used for initialization, retrogress is significantly less pronounced. This fact is not mentioned anywhere in the paper.

Furthermore, the authors seem to have missed some immediately related literature. here are a couple of examples that the authors should consider:
- https://doi.org/10.1038/s41746-020-00323-1
- https://doi.org/10.1038/s41598-020-69250-1
- https://doi.org/10.1038/s41746-021-00431-6
Some grammar consideration:
- Abstract
  - “Nowadays” > “These days”
- Introduction
  - “intelligent healthcare system” > “intelligent healthcare systems”
  - “parameters in element-wise” > “parameters in an element-wise manner”
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method and datasets being used are described in detail, but without the source code reproducibility is a concern.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Some interesting points are raised but the authors have skipped literature, have had grammatical errors, and since in this domain authors should consider adding their associated source code.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

If the authors have not missed the associated literature, they were more cautious in grammar and in their claims, as well as have shared their code, I would have recommended a direct “accept”.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

To combat retrogression, the phenomenon where federated models lose the ability to specialize to local datasets, the authors propose progressive Fourier aggregation (PFA) coupled with deputy-enhanced transfer (DET) to fuse multiple sites’ weights. This work assumes that low-frequency information in weights is safe to average, while high-frequency information is site-specific. Further, to discourage catastrophic forgetting, the DET component smoothly incorporates shared information. The authors compare to many competitive FL algorithms and also ablate their own method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents three main strengths: novelty in method (frequency-space aggregation), thorough comparison to other methods, and ablation studies with discussion on proposed novelties. Many references throughout to support the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

A lack of multiple runs for confidence intervals or statistical significance testing would benefit the paper. Some visualization of correct/incorrect images would be nice too; however, space restrictions make this difficult. However, these weaknesses are minor and likely will be covered in the journal follow-up.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This work is fairly reproducible. The methods are clear and the datasets are public. There is some ambiguity on exact preprocessing and split of the data, but this is minor.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Overall excellent paper with a clever approach and great results. Looking forward to a more comprehensive follow-up which investigates qualitative results further, as well as a more developed theory explaining why aggregation in the Fourier domain works.
Please state your overall opinion of the paper

ground-breaking (10)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Aside from the novelty, explanation, and numerical results, performing operations on neural network weights in the frequency domain is ground-breaking. The current scheme progressively increases the bandwidth of the LPF during aggregation. I anticipate future works to improve on this method. Overall outstanding.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

7
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviews for this paper that proposes a novel aggregation method in Fourier-Transformed parameter space to address the retrogress problem in a FL setup is throughout positive. I would like to congratulate the authors for this outstanding paper. It addresses an important problem of medical FL with a novel approach that is evaluated convincingly. Some details are missing or unclear as highlighted by one of the reviewers, which should be added or clarified in the manuscript.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

We thank the Meta-Reviewer and Reviewers for their valuable comments. We appreciate it very much that the reviewers gave very positive feedback.

R1 Q1: Is the retrogress commonly considered a problem? A: To the best of our knowledge, we first explicitly pointed out the abrupt performance drop caused by the server aggregation of existing FL methods, which is termed as retrogress, and proposed the PFA and DET to resolve it on the server and client side, respectively. Slightly similar to the retrogress problem, ref [20] considered the permutation matching problem of network parameters in FL, which may become one of the reasons for the retrogress.

Q2: Is DET followed by E epochs of training or is the whole local training E epochs long? A: After each communication iteration, all FL methods conduct total E epochs local training, which is a fair comparison setting. Specifically, the baselines with a single model on each client [12, 7, 1, 22, 8] conduct E epochs training with cross entropy loss, while FML [16] and our method conduct E epochs training under mutual learning and DET, respectively.

Q3: Does it make sense to reshape the parameter matrix? Why is it different than in ref [9]? A: (1) Reshaping the parameter tensor of conv layer is a widely-used trick for further analysis, which follows the operations in [9] and [#1]. (2) The tasks of these two works are different. Ref [9] focused on the inference efficiency of CNN, and further replaced the spatial convolution with DFT-based matrix multiplication. In this work, we analyze the parameters in frequency domain to improve the server aggregation. [#1] Convolutional neural networks with low-rank regularization, ICLR, 2016.

Q4: Hyperparameters: how were they selected? A: The hyper-parameters were selected according to the performance on the validation set.

Q5: Imbalances: Does it make sense to take the mean over different AUCs? A: The category imbalance is a challenge for dermoscopic diagnosis, and the averaged AUC is commonly utilized to evaluate the performance on skin lesion tasks [#2]. In addition, we calculated the F1 score for a more comprehensive evaluation. [#2] A Mutual Bootstrapping Model for Automated Skin Lesion Segmentation and Classification, TMI, 2020.

Q6: Table 2, the authors probably use the mean of the per-institution metrics, rather than the average weighted by the number of samples. A: Yes, we adopted the macro average, rather than the micro average. The micro average gives each client the importance of sample size, which is likely to overlook some clients with fewer samples (e.g., client C). Therefore, the macro average is relatively more reasonable.

Q7: Do you have personalized BN statistics? A: (1) The personalized model p on each client does not communicate with the server directly, which always holds its own BN statistics. (2) During the Recover step in DET, the deputy model d learns knowledge from the personalized model using client data. We empirically find that the deputy model using either the server BN statistics or personalized BN statistics does not affect the performance of the personalized model p. Therefore, the BN statistics are not the focus in our FL framework.

R2 Q1: Although retrogress is a valid concern in FL, if appropriate data harmonization is done correctly, retrogress is significantly less pronounced. A: Data harmonization may require additional information, e.g., data distribution of clients, which would violate the privacy setting of FL. In this work, we formulate the real-world FL task from a rigorous perspective. The proposed framework improves the server aggregation and client training to handle the retrogress, providing a fair setting for comparison with existing FL methods.

R3 Q1: Overall excellent paper with a clever approach and great results. Looking forward to a more comprehensive follow-up…works. A: We really appreciate the reviewer for recognizing the merits of this work, and are going to solve the limitations.

back to top

Personalized Retrogress-Resilient Framework for Real-World Medical Federated Learning