Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi

# Abstract

Despite recent advances in deep learning based medical image computing, clinical implementations in patient-care settings have been limited with lack of sufficiently diverse data during training remaining a pivotal impediment to robust real-life model performance. Continual learning (CL) offers a desirable property of deep neural network models (DNNs), namely the ability to continually learn from new data to accumulate knowledge whilst retaining what has been previously learned. In this work we present a simple and effective CL approach for sequential multi-domain learning (MDL) and showcase its utility in the skin lesion image classification task. Specifically, we propose a new pruning criterion that allows for a fixed network to learn new data domains sequentially over time. Our MDL approach incrementally builds on knowledge gained from previously learned domains, without requiring access to their training data, while simultaneously avoiding catastrophic forgetting and maintaining accurate performance on all domain data learned. Our new pruning criterion detects \textit{culprit units} associated with wrong classification in each domain and releases these units so they are dedicated for subsequent learning on new domains. To reduce the computational cost associated with retraining the network post pruning, we implement MergePrune, which efficiently merges the pruning and training stages into one step. Furthermore, at inference time, instead of using a test-time oracle, we design a smart gate using Siamese networks to assign a test image to the most appropriate domain and its corresponding learned model. We present extensive experiments on 6 skin lesion image databases, representing different domains with varying levels of data bias and class imbalance, including quantitative comparisons against multiple baselines and state-of-the-art methods, which demonstrate superior performance and efficient computations of our proposed method.

SharedIt: https://rdcu.be/cyl8b

N/A

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

The paper presents a continual learning method for sequential multi-domain learning (MDL). The authors propose a unit pruning scheme that identifies and prunes ‘culprit units’ that are likely to be less important in the prediction of classification. ‘MergePrune’ technique is then proposed to re-trains the removed units when training data from a new domain. The proposed method was applied and tested on skin lesion classification and the results show that it outperformed other existing pruning methods.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Continual Learning and multi-domain learning are important techniques that can benefit many medical image analysis tasks.
2. The paper is generally well-written, and the proposed idea of using ‘MergePrune’ and the Siamese network reduce the computational costs and achieved higher accuracies compared to other existing pruning methods.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The performances of proposed method are dependent on the use of key parameters such as pruning ratio and pruning interval. The authors empirically chose these parameters to find the best performances.
2. It would have been good to include some comparative analysis on computational costs of proposed method.
• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The experiment results can be reproduced since the key hyperparameters authors used are described in the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The contribution of paper could have been stronger by adding more experiments on computational efficiency of proposed method.

It is also important to investigate the impact of using different sequences of data domains. The overall learning outcome may be completely different depending on what training sequences are used.

It will be nice if the paper can discuss why removing all incoming and outgoing connections may make the pruning scheme more robust.

Table 2, please label the 5 different fine-tuning methods (Experiment C). How are the different?

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of using pruning scheme for multi-domain learning was interesting. The paper is generally well-written.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

The paper describes a method based on continual learning and pruning for multi-domain learning and shows a use case of the skin lesion classification task. The paper’s claims reside on novelty regarding the importance node score function that is used to select which nodes will be removed during each training domain.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The issue of accurate multi-domain classification that the paper address is essential for the community.
• The proposed pipeline is technically sound. CL learning is used in multi-domain learning and pruning helps to overcome the forgetting problem. The proposed pipeline is tested in different settings with baselines. -The presentation of the work is well structured (I mention some minor comments below in the detailed section). The problem is well-motivated, the methodology section is easy to follow, and the experiments are explained in detail.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- One of the claims of the paper is "computationally efficient and scalable" (title, abstract, and Introduction), but no runtimes of the experiments or complexity time/memory analyses are provided.
- The paper presents a new pruning criterion for CL. Nonetheless, the approach seems quite similar to [0]. In [0] "Freeze important nodes" and "Nullify transfer from unimportant nodes", once a node has been identified as unimportant, its outgoing weights should be pruned.  If the novelty resides on the pruning approach, I think mentioning the latest and most common methods to score nodes during pruning (in a Related Work section) will help understand the proposed method's novelty [0, 1, 2].


References: [0] Jung, S., Ahn, H., Cha, S. and Moon, T., 2020. Continual Learning with Node-Importance based Adaptive Group Sparse Regularization. arXiv e-prints, pp.arXiv-2003. [1] Golkar, S., Kagan, M. and Cho, K., 2019. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476. [2] Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V.I., Han, X., Gao, M., Lin, C.Y. and Davis, L.S., 2018. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9194-9203).

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Highly reproducible. Detailed Methodology and Experimental Setup. Public datasets and code will be available after publication. My only concern is if one of the claims of the paper is “computationally efficient and scalable”, but no runtimes or complexity time/memory analyses are provided we can’t asses that aspect of the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Major comments: - The proposed approach has three main components: MDL approach + pruning + smart gates for domain mapping. I think ablation studies regarding the improvements that the smart gates provide to the overall solution would make a stronger paper. - One of the claims of the paper is “computationally efficient and scalable” (abstract and Introduction), but no runtimes or complexity time/memory analyses are provided.

• If the novelty resides on the pruning approach, I think mentioning the latest and most common methods to score nodes during pruning will help understand and differentiate the novelty of the proposed method [0, 1, 2].
• The Introduction motivates variability across sources (patient demographics, unbalanced disease class, etc.), which I think makes the problem to solve interesting. In the future, I would like to see stratified results in any of these cases.

• Abstract: there is an uppercase “THE”
• Abstract: for/to select one
• Introduction: available/accessible, select one for clarity.
• Introduction: locked/frozen, I think it improves the clarity of the paper if one term is defined and used along the rest of the manuscript.
• Introduction: filters/neurons, this means that the pruning is applied to different abstraction levels filters (set of neurons) and individual neurons? Or taken as synonyms?. Afterward, in the Methods section, there is a reference as activations. I think this needs clarification to understand how the “culpability scores” are calculated (pre or post-activation, on a group of nodes or individual nodes?)
• Table 1 can be more informative if unbalanced class proportions could be added, population information, etc. The Introduction was so well motivated regarding the different type of source variability that I think that is something that should follow the Experimental Setup.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the proposed pipeline is technically sound. The novelty of the method needs to be clarified, as currently, seems similar to the node importance score from [0]. One of the claims of the paper is “computationally efficient and scalable”, but no runtimes or complexity time/memory analyses are provided.
Also, if the paper will use more information from their datasets, to stratify their analyses. How the model performance under disease classes that are rare in each domain (take the disease with less samples in each dataset and test). How the model performance in biased datasets regarding demographics? Those types of analysis will make an strong and novel way to look at the evaluation of multi-domain approaches, aside from reporting average accuracy.

[0] Jung, S., Ahn, H., Cha, S. and Moon, T., 2020. Continual Learning with Node-Importance based Adaptive Group Sparse Regularization. arXiv e-prints, pp.arXiv-2003.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

This paper proposes a new strategy for neural network pruning based on penalizing “culprit units” in a CNN model. They developed a “pruning-while-training” method that allows the network to continuously learn from data in different domains while avoiding catastrophic forgetting and accuracy drops in performance. The authors test their strategy on six well-known datasets on the task of skin lesion classification.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Despite its simplicity, the culprit-prune approach presented by the authors is effective and well formulated. Its simplicity also makes it easy to compute to be used in the “pruning-while-training” strategy.

In my opinion, one of the strengths of the paper is the extensive and well-designed experimental setup. The authors report their results as the average overall permutations of the six domains employed in the study. This choice helps to trust the results when judging the efficacy of the proposed method. Additionally, Table 2 is well organized given that most of the paper’s results are summarized there, including state-of-the-art comparison and a sort of ablation study in experiments J-M.

The paper is well organized and easy to read for the most part. I enjoyed the introduction and the way the problem and propose method are introduced.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

My main concern is that all six datasets employing in the study are very similar. It is true what the authors claim when they say: “each [dataset] comprising skin lesion images collected using different equipment at different clinical sites,” however, in the image space, all these “domains” share a lot of visual properties.

Another concern related to the structure of the paper is that in section 2, the notation gets slightly confusing and inconsistent, reducing the clarity of the proposed method. That being said, I did not find any mistake in the method’s formulation.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

All the datasets used in the study are publicly available and well-known in Skin Lesion segmentation/classification. The authors also mention the training/testing ratios, the training recipe: learning rate, batch size, and the number of epochs, and the baseline model (ResNet-50). However, I did not see any mention of the model used in the Siamese network. I consider that the results from the paper can be reproduced with the information provided by the authors. However, I suggest clarifying some parts from the second paragraph of section 3 that make the training pipeline slightly confusing.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

In section 2, the notation to describe the Culpability matrix changes. It Sometimes appears as C_{m,n,d} and sometimes as C_{d}. Since the notation was defined in the section’s first paragraph, I would suggest keeping it consistent.

In section 2, the description of the MergePrune strategy can improve in clarity. After careful reading, one can understand what the pipeline is, but it is slightly confusing. Additionally, I suggest expanding and clarifying the sentence “[…] which enables successive epochs to compensate for possible accuracy drops associated with the pruning”. Does it mean that the units that were not pruned will get adjusted to compensate for the drop?

I wonder whether the Siamese network sent a test image to the wrong domain. Did this happen in some cases? If yes, how often?

Since a critical concept in the paper is cross-domain learning, I consider that at least a visual comparison of the domains (Skin lesion datasets) would have been useful to understand the challenges the network faced when “accumulating knowledge.”

Because the proposed method’s results are close to the TE pruning method, I would suggest expanding a bit more on the discussion and comparison with this baseline.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Despite the simplicity of their weight-pruning strategy, this paper’s approach exhibits good properties to 1) learn to classify images from different datasets/ Domains (but please note my main concern in the weaknesses section), and 2) for minimizing model size and catastrophic forgetting from previous domains. The authors also performed careful and extensive experimentation to compare their results with other pruning methods in the state-of-the-art.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

2

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper presents a method based on continual learning and pruning scheme for multi-domain learning, which has been validated in the skin lesion classification task. Overall, this paper is well written and easy to follow. MergePrune and the Siamese network can be used to achieve higher accuracies compared to other existing pruning methods while also reducing the computational costs. It is better to clearly discuss the computational complexity and memory cost of the proposed method by referencing other existing state-of-the-art methods, which will further convince readers about the novelty of the proposed method.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

# Author Feedback

We would like to thank the reviewers for their feedback. We are grateful to have received a provisional acceptance and will revise our paper to address the reviewers’ main comments (see responses below) as well as minor comments such as typos (not included here).

• R1&R2: Computational efficiency claims. In contrast to most existing approaches to incremental learning, which progressively increase network size to accommodate learning of multiple tasks/domains, our network has a fixed-sized architecture, which is advantageous in terms of scalability and computational efficiency. Also, in our Merge-Prune, instead of requiring a computationally-expensive retraining of the network to compensate for accuracy drops associated with pruning, our method gradually prunes the culprit units while simultaneously training the network.

• R1: Testing different sequences of training domains. We already tested all possible sequences/orderings; page 6: “We note that our reported accuracy results in experiments labelled with (*) are averaged over all possible runs (5!=120 permutations of domain ordering).”

• R1: Clarify cross-domain fine-tuning. In Table 2, rows of experiment C, our ImageNet-pretrained model is finetuned on each domain and tested on all domains, i.e:

Exp.C-Row.1: Model fine-tuned on HAM; model tested on in-distribution HAM and unseen domains DMF, d7p, MSK, etc.

Exp.C-Row.2: Model fine-tuned on DMF; model tested on in-distribution DMF, previously learned domain HAM, and unseen domains d7p, MSK, etc.

and so on.

We observed an increasing drop in performance on previously trained datasets as we fine tune on more newer domains due to ‘catastrophic forgetting’. We also saw poor performance across datasets that were not part of the training process. Better performance was observed with our method Table 2, D.

• R1: Robustness of removing all incoming and outgoing connections (aka unit-pruning). We found this to help in reducing learning interference with previous tasks and thus overcome the forgetting problem, as demonstrated in Table 2, experiments E to I, where the performance of our incremental learning improves when unit-pruning is done instead of weight-based pruning.

• R3: Details of the Siamese Network. Its backbone is a regular convnet with architecture conv-pool-conv-pool-conv-pool-conv-pool-fc-fc-softmax. We train it using cross-entropy loss for 5 epochs with a constant learning rate of 1e-5 and a batch size of 32. Regarding how often the Siamese network has sent a test image to the wrong domain; that was very rare, occurring 2.18% of the time.

• R3: Clarity in the Merge-Prune process. Merge-Prune merges the pruning and training stages into one. We prune a specific number of units after each pruning interval (15 epochs in our experiments). This enabled successive training epochs to compensate for expected accuracy drops associated with the pruning, ie, un-pruned units will get adjusted to compensate for any possible accuracy drop.

• R3: Comparison with Taylor Expansion pruning (TE). As shown in Table 2, our results are very close to TE. Nonetheless, the two methods are different in terms of functionality. TE defines the importance of a unit as the squared change in loss induced by removing a specific filter from the network. Since computing the exact importance is quite expensive for large networks, they approximated it with a Taylor expansion. In our proposed method, the importance of a unit in a network is defined based on a culpability score which is calculated directly from the unit activations.

• R2: Comparison with existing pruning-based techniques ([0] Jung et al 2020 arXiv:2003.13726; [1] Golkar et al 2019 arXiv:1903.04476; [2] Yu et al 2018 CVPR). We will cite and discuss those references in context.