Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Mahsa Ghorbani, Mojtaba Bahrami, Anees Kazi, Mahdieh Soleymani Baghshah, Hamid R. Rabiee, Nassir Navab

Abstract

The increased amount of multi-modal medical data has opened the opportunities to simultaneously process various modalities such as imaging and non-imaging data to gain a comprehensive insight into the disease prediction domain. Recent studies using Graph Convolutional Networks (GCNs) provide novel semi-supervised approaches for integrating heterogeneous modalities while investigating the patients’ associations for disease prediction. However, when the meta-data used for graph construction is not available at inference time (e.g., coming from a distinct population), the conventional methods exhibit poor performance. To address this issue, we propose a novel semi-supervised approach named GKD based on the knowledge distillation. We train a teacher component that employs the label-propagation algorithm besides a deep neural network to benefit from the graph and non-graph modalities only in the training phase. The teacher component embeds all the available information into the soft pseudo-labels. The soft pseudo-labels are then used to train a deep student network for disease prediction of unseen test data for which the graph modality is unavailable. We perform our experiments on two public datasets for diagnosing Autism spectrum disorder, and Alzheimer’s disease, along with a thorough analysis on synthetic multi-modal datasets. According to these experiments, GKD outperforms the previous graph-based deep learning methods in terms of accuracy, AUC, and Macro F1.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_68

SharedIt: https://rdcu.be/cyl6L

Link to the code repository

N/A

Link to the dataset(s)

http://preprocessed-connectomes-project.org/abide/

https://tadpole.grand-challenge.org/Data/

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a teacher-student graph network to predict disease class on the population dataset when the patient relationship information is missing during the testing phase. A teacher network learns the graph relation across subjects by label propagation methods after predicting soft labels from a DNN. A student network tries to predict the actual labels for individual subjects without considering any association information. Experiments on ABIDE and TADPOLE dataset is shown along with the few ablation study on synthetic datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The application of the teacher-student network on population graphs for disease analysis is a novel formulation.
- In practice, the method formulates to overcome the absence of graph data during testing.
- The method smartly leverages the semi-supervised setting to predict the soft labels for the teacher network without adjacency information and of use LPA for finetuning labels.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Since the method relies on the absence of graph data during testing, have the authors tried a cross dataset testing for instance for Alzheimer’s prediction? What other datasets need to be similar for the training process?
- What is the significance of the remembrance term in terms of the performance of the proposed method?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- The authors have experimented on different hyper-parameters and have mentioned it in the checklist.
- The code for the work is currently unavailable in the paper. The authors have reported its availability in the checklist.
- The construction of the graph could affect the performance. However, the performance would correlate across the baseline and proposed method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- The baseline graph-based methods are dependent on the graph construction. If the same threshold as the paper is used for graph construction, Is the same graph connectivity used for the graph teacher network?
- The performance of the baseline graph methods on TADPOLE dataset is close to the proposed methods. Is it more due to the class imbalance present in the dataset or due to graph construction?
- The legends of the plots are very small for reading. Increasing the font size can help the readers in the camera-ready version.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes an interesting teacher student network when the graph relations are missing during the testing. The method has a potential impact in real practical setting.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

GCNs have been proposed by Parisot et al. for the goal of disease classification. In such works, multimodal datasets are used: imaging features are often used as node vectors, while the weights stored in graph edges are computed mainly from non-imaging data.

This paper raised the concern that some of these data modalities might not always be available, especially when using other datasets during inference. Thus, it is proposed that knowledge distillation is used: pseudo-labels are created by a teacher model that was trained on all available modalities. The training is performed in a semi-supervised manner and the key novelty is a modification of the Label Propagation Algorithm - specifically, a remembrance term to avoid forgetting the initial labels - to propagate labels from the labelled portions of the dataset to the unlabelled parts. The pseudo-labels are then used to train the student model without the need to use information from the graph’s edges.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Incomplete modalities is indeed a huge problem that impedes progress in multi-modal modelling of neuroimaging datasets and the use of GKD to obviate the need for all modalities during inference seems to be a very promising idea.
- Proposed model was compared with several other state of the art models and evaluated on 3 very different datasets: ADNI (elderly) for AD classification, ABIDE (young children) for ASD classification and synthetic dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Novelty seems limited and incremental. The key methodological novelty is the proposal to add a remembrance term to LPA. LPA is a well-established algorithm (since 2002 by Zhu et al.) in the semi-supervised learning literature. The use of Teacher-Student modelling techniques / knowledge distillation framework was applied from existing literature, seemingly without significant novelty added.
- If the key novelty is indeed the remembrance term, then it would be expected that an ablation study is included as part of the experiments conducted (GKD with remembrance term vs GKD without the remembrance term). However, this wasn’t seen in the manuscript.
- The problem of missing modalities might not be valid for the specific example in Parisot et al.’s framework. In that instance, demographic factors such as age and gender was used to build the graph. It is very unlikely that such metadata are not available even when using other datasets. However, if we look beyond Parisot’s model and think about what the proposed framework could achieve in a more general multi-modal modelling setting (e.g. if imaging features are used and unavailable during inference because another dataset was used), there remains great promise but it should be robustly backed with relevant experiments.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No major issues. Code not available at time of review, but that’s reasonable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Major
- Seems like LPA was mentioned multiple times in the paper but not cited (http://pages.cs.wisc.edu/~jerryzhu/pub/CMU-CALD-02-107.pdf).
- The data processing steps in Section 4.1 could have been clearer. For example, “we discard the phenotypic features that are only available for ASD patients” - exactly which features were discarded?
- Same goes for Section 4.2: “we connect every pair of nodes with the absolute distance less than a threshold” - what is the threshold?
Minor
- Although it didn’t significantly impede understanding (and not taken into consideration for the scoring), writing could be slightly improved. e.g. in the introduction, “Considering the relationships between the patients based on one or multiple important modalities is beneficial as it helps to analyze and study the similar cohort of patients together.” - do you mean that if studies are limited to unimodal analyses, then we won’t be able to analyse the entire population? But the subset of data with complete modalities is very small (thus covering a smaller population), usually to the extent that modelling isn’t very possible. Perhaps there’s when GKD comes in handy, but it doesn’t change the fact that the ground truth available (number of subjects with all modalities available) is usually very limited. Hence the point seemed rather confusing, but it didn’t affect the overall understanding so it’s okay.
- Supplementary Table 3 - in the caption, the dataset used should be ABIDE instead of TADPOLE?
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the experiments are thorough, a key experiment seems to be missing (GKD with remembrance term vs GKD without remembrance term). If that can be added during the rebuttal phase and the results show that adding the remembrance term is key to making the knowledge distillation framework work, I’m of the opinion that this work should be strongly accepted and it will be of great interest for researchers working on multi-modal neuroimaging datasets.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper presents a new method for knowledge distillation using graph information such that the effective predictions can be made at inference time without the graph structure. Using the teacher-student framework, the teacher component first learns a neural network that does not depend on the graph structure, but then uses the label propagation algorithm to assign pseudolabels to unlabeled samples via the graph structure. The student network then learns to predict the pseudolabels so that it can perform predictions from node features alone (i.e., no graph structure available). The method is tested on 2 open datasets and compared against several other methods using multiple performance measures.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper presents a novel approach to knowledge distillation of graph knowledge in an effective semi-supervised setting.
2. The results compared to other approaches appear convincingly better, with proper statistics reported.
3. The authors include synthetic experiments to demonstrate the effect of the proportion of missing graph features on the different models.
4. The paper is well-written making it easy to follow, and the figures/plots are effective.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. While multiple models are compared to the proposed approach including FCN and GCN based, none of the compared approaches use a knowledge distillation approach.
2. It is not really clear to me how the inference for the GCN methods is performed with “graph modality not available”.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The methods are tested using 2 public datasets.

The authors indicate that they will be sharing the code.

I find that the authors are very thorough in terms of detailing the methodology used for the experiments and the reporting and analysis of the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. While the experiments include comparisons to many different methods that combine multi-modal data, there is no comparison to other distillation approaches that also make use of graph data, for example:
Yang, Yiding, et al. “Distilling knowledge from graph convolutional networks.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

At a minimum there should be some discussion of how the authors approach is different from other graph distillation methods.
1. For the GCN methods, the graph is constructed based on the phenotypes of the population. It is unclear to me how the authors perform inference for these methods without the “graph modality”. For example, the GCN method by Parisot et al. trains the model with the test data as a part of the graph and is used in a semi-supervised fashion. This point could use some clarification.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors present a simple and effective approach for performing knowledge distillation with graph-based multimodal data. While there are a few details I found confusing and some other comparisons would have been ideal, overall the experiments are thorough and demonstrate the promise of the proposed method for incorporating graph knowledge at training but not require it for inference on new data. Thus, my recommendation trends toward acceptance.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

6
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper presents a graph-independent inference model for disease classification based on a teacher-student network architecture for dealing with different multimodal information at training and testing stages. The proposed work targets a confront problem of largely missing data samples in real datasets. The approach has novelty and demonstrates promising results. In the rebuttal, it is suggested to clarify the effect of the remembrance term. Moreover, discussions on similar or related works (LPA and graph distillation methods) and experimental results are recommended to include.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

We appreciate all the reviewers for their very positive and constructive comments and their acknowledgment of our method’s “novel formulation” to “overcome the absence of graph data during testing” in the disease prediction domain. The reviewers agree to the significance of handling the missing modality “as a huge problem” through multi-modal modeling of neuroimaging datasets. They also acknowledge that “the use of GKD to obviate the need for all modalities during inference seems to be a very promising idea”. The reviewers are satisfied with the sufficient experiments where the “proposed model was compared with several other states of the art models and evaluated on 3 very different datasets”, demonstrating the “convincingly better” results of GKD. Gladly the reviewers find the paper content “well-written” and “easy to follow”. We are happy to see that the meta-reviewer also appreciates our work and refers to it as an approach with “novelty” that demonstrates “promising results”.

Here we would like to clarify the main comments of the reviewers: Regarding the novelty (R3), we want to point out that as noted by MR, R2, and R4, novelty is one of the main strengths of our paper, as we are the first to bring the idea of knowledge distillation to graph data in the medical domain. However, the existing distillation methods are aimed at model compression, while our method takes advantage of the teacher-student idea to exclude the graph modality in test time which is a novel and totally different direction compared to the literature on knowledge distillation. In the paper proposed by Yang et al. (2020), mentioned by R4, the student network is a compressed version of the teacher GCN. Therefore, the resulting student network has no superiority over its teacher network, and comparing to such graph distillation methods is same as comparing to regular GCN, which we did in all the experiments of the paper. Moreover, adding the remembrance term to LPA algorithm is the third novelty that enables us “to avoid forgetting the initial labels”. We will clarify the distinction of our method compared to the literature on graph knowledge distillation in the final version.

As discussed in the paper, the coefficient of the remembrance term is a hyper-parameter, and we obtained the best model with the coefficient greater than zero, which shows the effectiveness of adding the remembrance term and the reason why we included it in the method. We will add the performance of the method without the remembrance term to the manuscript in details as it is requested by MR, R2, and R3, who is the only reviewer that did not accept the paper but explicitly mentions that if this can be added during the rebuttal phase the reviewer is “of the opinion that this work should be strongly accepted and it will be of great interest for researchers working on multi-modal neuroimaging datasets”. We will therefore report our observation that adding this term results in 5% and 14% improvement of accuracy, and 6% and 7% of macro F1 on ABIDE and TADPOLE datasets, respectively.

Regarding the cross dataset testing (suggested by R2), it is an interesting idea to study; however, it is beyond the scope of this work and can be studied as an extension in the future.

Reviewers have some minor points that we will take into consideration to improve the final version. In this regard, we plan to:

Provide the name of the discarded features which are mostly available for ASD patients (not normal ones) in the ABIDE datasets to avoid feature leakage.

Clarify that the threshold is an empirically chosen value to connect each pair of nodes if their feature distance is less than the threshold value.

Make clear that when the graph modality is not available during the inference time, we assume that nodes are isolated, and an Identity matrix is used as the graph adjacency.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The proposed approach, a graph-independent inference model using a teacher-student network for multi-modal information based classification, has novelty, and results are reasonable. The rebuttal addressed review comments clearly, including the remembrance term and evaluations.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Most of the major concerns (such as the performance without the remembrance term) have been well addressed.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper introduces an interesting and novel method of a teacher student network in a sense of knowledge distillation, when the graph relations are missing during the testing. The newly added ablation result seems to strengthen the validity of the proposed method. As pointed out by the reviewer #3, it is recommended to discuss the clinical value of the work in terms of missing modalities. Note that the demographic information is always available in a clinic environment.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

back to top

GKD: Semi-supervised Graph Knowledge Distillation for Graph-Independent Inference