Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ting Xiao, Han Zheng, Xiaoning Wang, Xinghan Chen, Jianbo Chang, Jianhua Yao, Hong Shang, Peng Liu

Abstract

Intracerebral hemorrhage (ICH) is the deadliest type of stroke. Early prediction of stroke lesion growth is crucial in assisting physicians towards better stroke assessments. Existing stroke lesion prediction methods are mainly for ischemic stroke. In ICH, most methods only focus on whether the hematoma will expand but not how it will develop. This paper explored a new, unknown topic of predicting ICH growth at the image-level based on the baseline non-contrast computerized tomography (NCCT) image and its hematoma mask. We propose a novel end-to-end prediction framework based on the displacement vector fields (DVF) with the following advantages. 1) It can simultaneously predict CT image and hematoma mask at follow-up, providing more clinical assessment references and surgery indication. 2) The DVF regularization enforces a smooth spatial deformation, limiting the degree of the stroke lesion changes and lowering the requirement of large data. 3) A multi-modal fusion module learns high-level associations between global clinical features and spatial image features. Experiments on a multi-center dataset demonstrate improved performance compared to several strong baselines. Detailed ablation experiments are conducted to highlight the contributions of various components.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_71

SharedIt: https://rdcu.be/cyl6O

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In patients with Intracerebral hemorrhage (ICH), most methods focus binary predictions such as whether the hematoma will expand or not, without knowing how it will develop. This study designed a new framework for predicting ICH growth at the image-level using baseline non-contrast computerized tomography (NCCT) images and the corresponding hematoma masks as input, along with clinical metadata of individual patients. The approach took advantage of a technique typically used in image co-registration, termed displacement vector fields (DVF), to improve prediction reliability and accuracy. Similar approach has been used by others for predicting infarct growth previously. Collective findings showed that the proposed method had a relatively higher accuracy and lower prediction error than a common approach based on 3D-UNet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Integrating DVF into the prediction framework for ICH growth is new 2) Performed cross validation using 4 of the 5 datasets from different imaging centres to improve the generalizability of the technique 3) Presented a new way of combining imaging and clinical features in deep learning 4) Developed strategies to identify the importance of input variables from clinical data, providing further understanding of the models particularly for a general audience

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The segmentation accuracy is about on average, with Dice score ranged from 0.69-0.73 at cross-validation using different datasets 2) There is no true ‘external’ validation, although cross-validation provided reasonable results

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There is no mentioning of data or code availability. The methodologies are mostly clear and understandable, except a few areas. For example, it is not very clear how the importance of clinical features are calculated, which makes it difficult to repeat. In addition, the acquisition protocol of the imaging datasets are not presented, although it says that the method performs well with thin-slice images.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The study overall is interesting, which is strengthened by mostly clear writing and designs. Further improvement is possible, especially regarding the following points: 1) In Section 2.2, ‘… the channel has only 3 feature maps…’. It would be clearer if specifically mentioning the ‘imaging’ channels. 2) In the same section, it should be ‘1x3’ in “… and obtain four 1x4 image feature vectors…”? 3) In Fig. 2, there is a ‘padding’ column, but that is not described in text. 4) How was the 300 epochs chosen? 5) How co-registration success was determined, any specific criteria? 6) In Section 3, under ‘Results comparison with baseline’, it is unclear why the ‘CT1’s hematoma mask as the predicted mask (benchmark)’. In the same section, what does this sentence mean “The hematoma volume predicted by our method and the 3D U-Net is larger than the GT” – any explanations? 7) Under ‘Case study’ of Section 3, much of the content belongs to a Figure caption more than main text. 8) As mentioned above, the explanation of how ‘the importance analysis of clinical features’ can be clarified further. Finally, proofread of the entire article to ensure free of grammar issues would be helpful.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and addresses an unsolved clinical problem. The technologies are relatively novel, including new implementations of clinical and imaging feature fusion, addition of DVF, cross-validation of framework, and detection of the importance of clinical input. Minor limitations are present as listed above.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The authors present a novel end-to-end prediction network for modeling hemorrhage growth. Prior methods mostly assessed growth as a binary variable which is of course very limited. This work aims to take a account of shape changes by modeling displacement vector fields. The UNET-based method predicts a displacement vector field which is applied to the input image and mask to output a CT scan and hematoma mask at follow-up. A fusion block is introduced to fuse clinical meta data to the U-NETs latent space. Experiments on a multi-center datsets showed improved performance. Innovation is not so strong, but validation is very strong.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Validation is strong: Multiple evaluation measures, valuable ablation study, valuable multi-center evaluation
    • Next to imaging data, the method integrated clinical meta data.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The deep learning network proposed is relatively standard. The paper is not very innovative on the methodology side.
    • What remains unclear to me is why the authors address this “new clinical problem in predicting the hematoma growth in ICH”. Why haven’t others addressed this previously, was is impossible before (before of available data/methods) or is it not very clinically relevant?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Data and code seem not available, but work is well-described

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Results: “Sex is the top factor. One possible explanation is the difference in living habits may cause that sex differences.” Is they any proof that the sex difference in hematoma growth is related to living habits? If no, remove this claim. If yes, please be more specific.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I really appreciate all the different aspects that are addressed in the validation study. However, on a methodology side there is not something I learned from this work.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    This paper presents a method for predicting Intracerebral hemorrhage (ICH) growth from baseline CT and hematoma mask and clinical metatdata. The method uses a deep learning registration framework for predicting the displacement vector fields (DVT), so that the predicted CT image and mask are computed from the warped baseline images. The approach is tested using ~400 subjects from multiple imaging sites, using single split for comparison to baseline methods and ablation studies. Leave-one-site-out experiments to test generalization and analysis of important clinical measures is also presented.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a new application of registration-geared deep networks for prediction of ICH growth.

    2. The method combines imaging with clinical measures to perform the prediction.

    3. The authors performed tests for generalization using leave-one-site-out experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are details missing in the methods section, particularly in explaining the implementation of the primary competing approach (Unet).

    2. The experimental evaluation is missing comparisons that would highlight the advantage of their model design choices.

    3. There are no tests for significant differences for the experimental results, making it difficult to assess improvements.

    4. The experimental results do not really support the stated conclusions, in my opinion.

    Please see Q7 for details on these points

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Dataset is private, from multiple institutions.

    It appears that the code will not be shared (responded N/A to all code questions).

    While authors checked “yes” for “Details on how baseline methods were implemented and tuned,” I am not seeing these details.

    Authors did not perform significance testing to analyze differences between methods.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Comments in order of the paper:

    1. The resampling approach along z-dimensions seems strange, where the authors “duplicate slices 1-2 times before resampling”. Why not just interpolate and resample the data as given?

    2. The dataset of ~400 patients undergoes one split for the experiments comparing to baseline methods and ablation study. Results would have been stronger/more convincing if a k-fold cross-validation approach were used.

    3. The authors call the use of hemotoma mask at baseline as predicted mask at followup as the “benchmark” approach. This does not seem like a real benchmark.

    4. One of the baseline approaches for comparison was “a standard 3D Unet for segmentation”. There were no other details given about this comparison method. Was the architecture used like encoder/decoder portion of the proposed model? What are the optimization details? Was it trained for segmentation using both time 1 and 2 images (seems like more fair so all data is used), or only time 2?

    5. Another value that would have been good to use for comparison would have been the “ground truth” results from using the displacement field from the image registration to warp the ground truth mask at baseline. While of course this can not be used for prediction in reality, the results from this “oracle” method would give the readers a sense of the upper bound best case overlap that may be obtained.

    6. The ablation studies presented are not testing the combinations that would help demonstrate the advantage of their design choices. For example, should have an experiment with Img alone should have been tested, since this is in 1 of the 3 main components of the loss function. Also, should have an experiment with DVF + Img without clinical metadata, since one of the main contributions proposed is the inclusion of such clinical data. These 2 experiments plus the one the authors did perform using DVF alone would have been the most informative.

    7. An additional experiment that would have been nice to see is testing the model for metadata fusion against e.g. simple concatenation of the metadata information to the encoded image information, since the authors claim in the introduction that such an approach is not sufficient, presumably leading the authors to the proposed fusion method.

    8. In Table 2, the dice values appear largely the same, given the standard deviation values and the lack of significance testing for differences.

    9. For experiments on external test set, it would have been nice to see the number of subjects for each center.

    10. For the external test set, the authors note that “the model performs worst on center 5” because of larger z slices. While likely contributing, the strange resampling used may be playing a role in the degradation.

    11. The AAEV in Table 4 for the external test sets is roughly 2 times larger than in the result for using all sites in Table 1. To me this suggests that the generalization to out of site is not as good, unlike the conclusions of “strong generalization ability” noted by the authors.

    12. While overall written fairly well, paper would benefit from review for language/clarity. Some examples: p.2 “wrap” –> warp p.3. Eq 3, I wonder if the notation is swapped, and P_i should be predicted image, I_2 was meant to be ground truth? p.4 last line, “averagely group slices” - not clear what this means

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors are addressing an interesting and challenging problem of predicting hemotoma progression for stroke prediction, and they apply registration-style methods to this problem. However, I trend toward reject due to: the lack of some major details about the comparison baselines; missing important ablation experiments that would help justify presented model design; low strength of evidence for the experimental results (with no significance testing and not obvious differences); and unsupported conclusions (e.g., large increase in error with external tests and large drop in dice for 1 site do not support good generalization)

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    6

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper is praised by all reviewers for the novelty of the framework proposed to combine imaging and non-imaging information However, limited improvement or lack of proper validation of the improvement along with absence of statistical testing do not allow for the conclusions of the paper to be properly justified.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5




Author Feedback

We thank all reviewers for the acknowledgement of the novelty of our paper. The main criticism is from Reviewer #4 on experiments, with major concerns explained as follows:

Q1, 7.3, and 7.4: Lack of details of comparison baselines. Answer: Taking CT1 hematoma shape as prediction result is a reasonable baseline, as for most cases the hematoma of CT2 have only slight changes compared to CT1’s hematoma, thus CT2 appear the same as CT1. The architecture of baseline 3D-Unet is the same as our encoder/decoder portion, except for the last output layer of the decoder. Its output is a segmented mask. Both CT1 and CT2 are used to train the 3D-Unet.

Q2 and 7.6: Miss two ablation experiments. Answer: Each loss function is relatively independent in our method. Our ablation study that removes each term one-by-one is adequate to prove the effectiveness of each component.

Q3 and 7.8: No statistical testing in Tab. 2. Answer: We add a two-sample t-test. We abbreviate 4 ablation experiments as 1-4 and use Pi(x,y) to represent the p-value of method x and y on certain metric i, * indicates a significant difference (p < 0.05). For Dice, P1(1, 2)=0.094, P1(2, 3)=0.037, P1(3, 4)=0.016. For Doc, P2(1, 2)=0.039, P2(2, 3)=0.048, P2(3, 4)=0.0019. For AEV, P3(1, 2)=0.38, P3(2, 3)=0.51, P3(3, 4) =0.024. For AAEV, P4(1, 2)=0.53, P4(2, 3)=0.027, P4(3, 4) =0.019. These results demonstrate that: 1) Fusing clinical metadata improves Dice, Doc, and AAEV significantly; 2) Further adding image prediction can significantly improve all metrics; 3) data augmentation is relatively less effective which only improves Doc significantly.

Q4, 7.1, 7.10:Z-axis resampling may degrade performance in Tab. 3, C5. Answer:We have tried interpolation for data resampling, which resulted in worse performance. It can be explained as interpolation can generate some fake image patterns, such as a blurred transition at edge of hematoma, which do not physically exist and may confuse model training. Our method can avoid this problem. The performance degradation in Tab. 3 is because the model does not perform well on thick-slice CTs which contain less detailed information compared to thin-slice CT. In fact, we have tried to model thick and thin CT separately while the thick CT model still suffers.

Minor concerns are explained as follows:

7.2: No cross-validation. Answer: As stated in Sec. 3, Para. 2, our results were averaged from 4 runs. For each run, training/validation/test sets were randomly split. Our evaluation approach is also very common, especially when the data set is relatively large.

7.5: Lack of comparison with “ground truth” by using the displacement field from the image registration. Answer: The reviewer had a misunderstanding that our method is “registration-geared deep networks”. Both our method and registration share the component of DVF, as a form of spatial transform, however, these two are essentially different since the input of registration is a data pair, while the input of our method is only CT1 as a prediction task. Therefore, such comparison is unnecessary.

7.7: Meta-feature concatenation. Answer: Empirically, a simple feature concatenation is not as effective as our fusion method. We did not include details due to limited space. A major drawback of feature concatenation is that the network may choose to ignore the metadata information and rely totally on image information, as metadata information are more abstract and may be hard to extract useful features at early stage of training. In contrast, we force each image feature to be associated with meta-features to ensures the utilization of metadata.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed solution appears novel and the experimental issues currently present in the manuscript have been well justified in the rebuttal. The inclusion of statistical testing of the results further strengthens the claims of the paper

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I follow reviewer #4’s comments and feel that the rebuttal clarified some questions from #4, but the problems are still there:

    1. The two methods compared (i.e., benchmark and 3D U-Net, the first is a copy of the CT1 mask, and the second is a segmentation network, presumably CT1, learned from CT1 and CT2) are really not “benchmarks” to compare with, they do not have any “prediction” components.
    2. Ablation study: the current ablation study really does not focus on the major contribution and cannot persuade the audience. The important contribution is the prediction of DVF and the use of meta-data for prediction, but the ablation study mostly compared the use of image based on segmentation. Naturally, the img+seg should be the baseline while adding augmentation and meta-data for fair comparison. From the new statistical test results, it is difficult still to figure out the final contribution of meta-data. Really, a fine tuned network with all the features without meta-data and another one with meta-data are important.
    3. The duplication of slices before interpolation is really weird to me, possibly need some justification or try some other interpolation like spline-based, etc.
    4. The paper lacks of some details, such as “co-registration using ANTs”: please do mention only rigid or global transformation is used.
    5. Another fundamental question toward this topic is: “hemorrhage could be blooding within tissues not deformation of a hematoma”, thus the motivation and rationale may need to be addressed well. Fig. 2’s augmentation resulting cropping of the brain can be avoid by first performing orientation and then cropping.
  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper present a novel end-to-end prediction network for modeling hemorrhage growth, which has not been explored in existing literature. The integration of DVF and fusion block for meta data are novel ideas. The authors have addressed the major concerns raised by reviewers. Though the clarify and details of the paper could be improved in the final version (e.g. number of runs for cross-validation, definition of Doc seems wrong where (Pm-M2) should be (Pm-M1) as this metric is the larger the better), the overall quality and contribution of this paper is relatively high.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4



back to top