Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yunan Wu, Arne Schmidt, Enrique Hernández-Sánchez, Rafael Molina, Aggelos K. Katsaggelos

Abstract

Intracranial hemorrhage (ICH) is a life-threatening emergency with high rates of mortality and morbidity. Rapid and accurate detection of ICH is crucial for patients to get a timely treatment. In order to achieve the automatic diagnosis of ICH, most deep learning models rely on huge amounts of slice labels for training. Unfortunately, the manual annotation of CT slices by radiologists is time-consuming and costly. To diagnose ICH, in this work, we propose to use an attention-based multiple instance learning (Att-MIL) approach implemented through the combination of an attention-based convolutional neural network (Att-CNN) and a variational Gaussian process for multiple instance learning (VGPMIL). Only labels at scan-level are necessary for training. Our method (a) trains the model using scan labels and assigns each slice with an attention weight, which can be used to provide slice-level predictions, and (b) uses the VGPMIL model based on low-dimensional features extracted by the Att-CNN to obtain improved predictions both at slice and scan levels. To analyze the performance of the proposed approach, our model has been trained on 1150 scans from an RSNA dataset and evaluated on 490 scans from an external CQ500 dataset. Our method outperforms other methods using the same scan-level training and is able to achieve comparable or even better results than other methods relying on slice-level annotations.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_54

SharedIt: https://rdcu.be/cyl25

Link to the code repository

https://github.com/YunanWu2168/Att-based-MIL-and-GPMIL

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In this submission, the authors propose a novel learning model for automatically detect hemorrhage in CT scans. Their method utilizes the feature learning strength of attention CNN and the feature aggregation ability of GP. The logic behind the method makes sense, the expression in the paper is explicit, and the experiments are well-designed.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The first highlight is the integration of CNN based method with Bayesian aggregation. (2) The second highlight is the introduction of the attention mechanism. (3) The paper is well-written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) As mentioned in Abstract and Introduction that rapid and early detection is important, the average prediction time of the proposed method is expected to be provided. (2) Motivation of using VGP to infer the final classification result is expected to be further explained. Although Experiment 4.3 compared CNN and VGP method, theoretically the VGP can be substituted with 2-3 fully connected layers.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Yes, the details of the experiments are provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Some points have been mentioned in Weakness. (1) Does the proposed method take layer-wise information into account? (2) In Fig. 1, what does the pre-processing do? The input is slice by slice and it seems each slice becomes multiple slices after pre-processing, please clarify this part in Fig. 1
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The general idea is solid and novel, which mainly affects my judgement to this submission. Besides, the ablation study in experiments makes the work more convincing.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

Creating labelled data for slice-level scans is a very laborious task. However, generating such labelled data is necessary before models can be trained to accurately perform tasks such as CT hemorrhage detection. This paper proposed using a combination of the attention mechanism (integrated with a CNN-based architecture) along with multiple instance learning (via variational Gaussian processes) to generate slice-level predictions even when only scan-level labels are provided, thus greatly reducing the effort required for manual data annotation.

The proposed approach was tested on both the RSNA dataset and CQ500 dataset and besides showing that the proposed approach outperform other approaches based on scan-level training, they also claimed to “achieve comparable or even better results than other methods relying on slice-level annotations”, even when only scan-level annotations are used for their model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The high cost of acquiring labels is a well-known problem, especially so for medical data where 3D volumes are involved. This paper proposed a way to obviate the need for slice-level annotations even for generating slice-level predictions. Considering that they manage to achieve AUC values that are very close to models trained on slice-level annotations, it seems that the proposed method will lead to significant time savings (from removing the need of annotating slice-level data)
- Ablation studies are thoroughly conducted to demonstrate the contributions of each part of the pipeline
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Contrary to what was mentioned in the reproducibility checklist, variation of results was not reported (i.e. standard deviations / error bars not shown), Considering that the time taken for training and inference is low, it would be reasonable to expect that the experiments should repeated over multiple seeds (~5) so that robustness of the results can be demonstrated. Results discussed (especially the comparisons made in Section 4.2) would’ve been much stronger if the standard deviation is computed and shown.
- Novelty might seem limited because 2 existing methods (CNN with attention, VGPMIL) are being combined (and trained in separate phases) without much modifications
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

“A description of results with central tendency (e.g. mean) & variation (e.g. error bars).” - marked ‘Yes’ but can’t be found in the manuscript. No other major issues besides this. Code not provided at the point of time review was written, but that’s reasonable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- t-SNE plot in Fig 2 seems not entirely convincing - there are still significant overlaps in AL-nAw and AL-Aw. But the quantitative comparisons clearly demonstrate the value of adding the attention layer.
- Outperforming other works (which have a larger dataset) doesn’t necessarily mean that the proposed method would continue to outperform the rest if the number of scans used is the same. In general, smaller datasets typically contain limited variations of data (a reference from fMRI studies, but still relevant: https://www.sciencedirect.com/science/article/abs/pii/S1053811917305311?via%3Dihub) and more experiments should be conducted on larger datasets to prove that the proposed method will really work better on them.
Minor comments
- Fig 1 makes it seem like I_b number of models are trained, instead of just a single CNN model.
- Very minor typo: In Section 3.2, “The batch size is 16 per steps” should be ‘step’.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Unsure if combining 2 established approaches together is considered sufficient methodology novelty for MICCAI, hence not a direct accept. But I think it’s a well-considered combination of existing methods with rather strong results (along with additional evaluation on a separate dataset) + model performance seems as good as models trained on slice-level annotations, hence the push for acceptance. But the another concern is that standard deviations are not reported and it doesn’t seem like the models are tested over multiple seeds.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Review #3

Please describe the contribution of the paper

The authors propose a multiple instance learning (MIL) method to derive slice-level labels from volume-level labels. To perform MIL, the variational Gaussian process method is used, and to achieve feasible dimensionality reduction for VGPMIL a CNN is needed to extract salient image features; these are trained in sequence. The novelty of this work is the first to use a CNN and VGPMIL in conjunction.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The primary strengths of this work are the sophistication in methodology and performance across datasets. Slice-level labels are achieved at very high accuracy on the in-domain dataset RSNA and relatively high on the out-of-domain dataset CQ500.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This work has a few major flaws. First, primarily, the data distribution is unclear. How many of the volumes have a positive label and how many have a negative label? Of these volumes, were slice-level labels provided? Were the training, validation, and testing datasets balanced class-wise? If so, were they balanced at the volume-level or the slice-level? These details are vital in interpreting the training and evaluation process, especially because ROC AUC is reported, which is biased towards positive classes if the classes are imbalanced.

Second, while the results in Table 2 are good, it is unclear whether these AUC scores are volume-level by some unknown aggregation of slice-level predictions, or whether these are aggregations of slice-level scores. ie: is the reported 0.97 AUC for the proposed method achieved by an average of slice-level AUCs between predicted and ground truth slice-level scores, or is it an average of volume-level AUCs? This detail is also important to interpret the result.

Third, there is little-to-no discussion on the comparative methods in Table 2. It seems the comparisons are not even on the same datasets. Why were these competitive methods not run on the same RSNA and CQ500 datasets as the proposed method? It makes it difficult to truly evaluate the contribution of this work.

Fourth, Ref 15 is mentioned as a comparable MIL work, and the proposed method here aims to improve on it by using a more sophisticated MIL pooling operation. However, the authors do not compare to Ref 15.

Fifth, there is no cross-validation or experiment repetition and therefore no confidence intervals provided in the results. How consistent/reliable is the proposed method?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Several reproducibility factors are satisfactory (data preprocessing, model parameters, hardware, training duration), but some are missing for full reproducibility. Essential details include dataset class balance for each of the train/validation/test splits and whether the reported scores are volume-wise or slice-wise.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The heart of this method is promising and looks strong. However, the paper organization is missing key details mentioned in the “weaknesses” section above. Too many paragraphs were spent on ablations of the proposed method and too few on comparisons to competitive methods. Also, it is essential to describe the dataset class distribution as well as what exactly the AUC scores represent. Further, without knowing the distribution of the testing data classes, the ROC AUC cannot be reliably interpreted due to inherent bias (that is, it is possible to get a high ROC AUC but low overall accuracy if there are few positive cases). It is important to communicate this in the paper for others to interpret results. Figure 2 is also difficult to interpret. The conclusion that the attention layer extracts “more expressive features” is unfounded. A better figure here could be slices from the validation set showcasing when the model succeeded and when it failed.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is interesting and promising, but the organization / focus of the paper is lacking. Additionally, considering the weaknesses mentioned above, there are too many missing details for acceptance.
What is the ranking of this paper in your review stack?

5
Number of papers in your stack

7
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work performs a multiple instance learning based strategy for intracranial hemorrhage detection which involves an attention mechanism for feature identification and a gaussian process mechanism for feature aggregation. While having some concerns about the novelty of this combination, reviewers generally comment favourably on the contribution and on parts of the evaluation. However, reviewer 3 states a number of limitations in the evaluation protocol, which need further clarification. Authors are encouraged to identify and address the main concerns from the reviews in their rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We appreciate all reviewers for their thoughtful and detailed comments. We are encouraged by the positive feedback on the good performance (R3) and significance of our weakly supervised method in clinical practice (R2), and the clear outline of our idea (R1, R2, R3). We answer reviewers’ major concerns below and will incorporate all feedback in the final version. [(R3)The data distribution is unclear in train, valid and test cohorts at both slice-level and scan-level (volume-level) and if they are balanced.] We agree with the comment some details regarding data distribution were not given. For completion, we provide all the information here. We have 1150 scans from the RSNA dataset with labels at both scan (volume) and slice-level. The labels at slice-level are for testing, not for training. The scans are split into 1000 for training (P:411, N:589), corresponding to a total of 34496 slices (P:4976, N:29520) and 150 for testing (P:72, N:78), corresponding to 5254 slices (P:806, N:4448). The training and testing dataset is well balanced at scan-level, so our AUC-ROC score is unbiased. Although two classes are imbalanced at slice level, this does not affect the main results of this paper, i.e., achieving high performance at scan-level. We also evaluated our model on an external CQ500 testing dataset to show generalization. This dataset has 490 scans (P:205, N:285) without slice labels. Some information was covered in the paper, and additional details will be added in section 3.1 of the final version. [[R1, R2, R3)The results were not reported for multiple runs (no standard deviations/ error bars).] We will report the results of five runs with different seeds in each experiment of table 1 and 2 in the final version. Overall, our model(Att-CNN+VGPMIL) for comparison in Table 2 shows an AUC at scan level of 0.964±0.011 for RSNA and 0.905±0.011 for CQ500. The low std proves a high reliability and robustness of our model. Other mean and std values also support the consistency of our method. [(R3)It was not reported if results in Table 2 were obtained on scan or slice level.] We are pleased the reviewers acknowledged our good results but we understand the metrics need further clarification. The CQ500 dataset has no slice labels, so all the results related to CQ500 in Table 2 (including our reported 0.96 AUC and other methods) are at scan-level. The only difference is the labeling types on the training dataset. The type “Scan” means we only have labels at scan-level in our training dataset, while type “Slice” means we have training labels at slice-level. Notice that training at slice-level[6,12] is an easier task (fully supervised) as it involves more labels than scan-level methods. In Table 1, the slice-level and scan-level metrics are clearly separated by columns. [(R2, R3)The comparison with existing state-of-the-art methods was not done on the same dataset in Table 2.] The bottom part of Table 2 shows the results for the CQ500 dataset. Here, our method is directly comparable to other methods as the same data is used for evaluation. We want to stress out that our model (AUC:0.905±0.011) outperforms the competing scan-level method [11] while showing a comparable performance to methods[12][6] that use the fully labeled dataset (at the slice level). The upper part of Table 2 shows a comparison to other scan-based methods. Unfortunately, the existing scan-based methods neither share their used data nor their code which makes it hard to directly compare on the same test set. Nevertheless, by comparing our method with their methods for the same problem using the same metric we are able to show our result is competitive. We report our results on the public dataset RSNA so that future researchers can easily compare our performance. In addition, the code will be published in the final version. [(R1)The prediction time was not provided.] The prediction time of our model is on average equal to 2.5s to only predict a full scan of one patient.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Reviewers overall comment favorably on the paper. Concerns raised by R3 regarding missing details on experimental setup that lead to a lack of interpretability of the results were sufficiently addressed in the rebuttal and I think these concerns can be clarified in a revised version of the paper, in case of acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Most of the concerns from reviewers have been well addressed in the rebuttal. However, the authors did not answer regarding comparison to the method in Ref 15. I don’t really see if this is crucial for the decision toward acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This submission proposes to combine multiple instant learning with Gaussian processes to detect CT hemorrhage. The authors addressed most of the concerns raised by the reviewers. Therefore, this manuscript is recommended for acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

back to top

Combining Attention-based Multiple Instance Learning and Gaussian Processes for CT Hemorrhage Detection