Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Sudhir Suman, Gagandeep Singh, Nicole Sakla, Rishabh Gattu, Jeremy Green, Tej Phatak, Dimitris Samaras, Prateek Prasanna

Abstract

With more than 60,000 deaths annually in the United States, Pulmonary Embolism (PE) is among the most fatal cardiovascular diseases. It is caused by an artery blockage in the lung; confirming its presence is time-consuming and is prone to over-diagnosis. The utilization of automated PE detection systems is critical for diagnostic accuracy and efficiency. In this study we propose a two-stage attention-based CNN-LSTM network for predicting PE, its associated type (chronic, acute) and corresponding location (leftsided, rightsided or central) on computed tomography (CT) examinations. We trained our model on the largest available public Computed Tomography Pulmonary Angiogram PE dataset (RSNA-STR Pulmonary Embolism CT (RSPECT) Dataset, N=7279 CT studies) and tested it on an in-house curated dataset of N=106 studies. Our framework mirrors the radiologic diagnostic process via a multi-slice approach so that the accuracy and pathologic sequela of true pulmonary emboli may be meticulously assessed, enabling physicians to better appraise the morbidity of a PE when present. Our proposed method outperformed a baseline CNN classifier and a single-stage CNN-LSTM network, achieving an AUC of 0.95 on the test set for detecting the presence of PE in the study.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_34

SharedIt: https://rdcu.be/cyl8u

Link to the code repository

N/A

Link to the dataset(s)

https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/data


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents an attention-based two-stage network for CTPA classification. They tested their framework on a large public dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem is interesting and the authors tested their approach on large dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no comparison although this research problem is rich with recent related work.

    The proposed framework is incremental and the combination of different well-known and established algorithms (Efficient-Net[21], BiL-STM [25], and attention model)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    can be reproduced

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Please compare with state of the art methods

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Weak experimental and limited contribution

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel deep learning architecture for detection of pulmonary embolism from CTPA images. It describes the architecture, and then the performance on validation and external test datasets. Results appear to achieve state-of-the-art AUC values, with the caveat that a direct comparison with other algorithms was not possible.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is very well written, and the methods are well-described. The architecture appears to be novel in that it uses a combination of traditional CNNs with LSTM and attention-gating and combines slice level predictions with image level predictions, thereby taking advantage of different annotation types.

    Another advantage is high clinical applicability of this method, as this is an algorithm that can really be used to improve radiological workflows and patient care rather than become another headache for the clinician.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness is that there is no head-to-head comparison with another architecture type, which would have shown

    Another major weakness of the paper is that some of the results appear to improve dramatically between the validation and testing. There should be more discussion about why this may be happening, as it may indicate that the test dataset is biased or highly unbalanced. In light of this, it might not be appropriate to claim such an improved performance over PENet without a head-to-head comparison. At least mention that in the discussion.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There is no provided code, but the network could be reproduced by graduate student and the training and validation data is open-source.

    How many slices were used for each training instance? Usually this has to be downsampled, as all slices can’t fit on the GPU. Was the CT cropped to the lungs and sampled at regular intervals? Not knowing this detail would make it hard to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    “This CNN classifier is trained for multi-label prediction on the RSPECT [6] dataset to capture the properties for different study labels.”

    What labels? This should be specified earlier in the paper. Many readers are not familiar with the annotation labels used for PE and why they are important.

    These 3 sentences are redundant: “The Sequence model, consisting of a bidirectional long short-term memory (BiLSTM) [25] and a dense layer, is used to capture long-range dependencies in CT scans. It makes use of the extracted features from study slices using the trained CNN classifier from Stage 1 and passes them through a Bi-LSTM and a dense layer to provide the network additional contextual information about the global changes around the slices. The features extracted using CNN-LSTM network capture spatio-temporal information in the CT scan volume. The ‘temporal’ aspect refers to the global relationship between successive slices.”

    I wouldn’t choose to use the words “image” and “study” to describe a slice and the set of slices making a volume. “Study”, in DICOM terminology, can refer to multiple scans within the same imaging session. And sometimes “image” refers to the whole volume. Better to say “slice” and “series”.

    “ 3 expert readers” What is their specific expertise? Radiologist, Cardiologist, Pulmonologist?

    On figure 3, the colors used for part A and B should be the same for the same labels to avoid confusion.

    Figure 2 and the section on label consistency was a little confusion. Were these results implemented at inference time or were they somehow included in the network training?

    In the validation dataset, the RV/LV ratio > 1 AUC is 0.700, but in the test dataset it is 0.957. How did it get so much better? That seems unlikely.

    Why is RV/LV > 1 and RV/LV < 1 both being considered categories since they are complementary (if one is true, the other must be false)?

    What is the difference between “PE present on image” and “study positive for PE”?

    “The results of the challenge are also suggestive of the fact that a 2D-CNN with a Sequence model works better than a 3D-CNN for this task.” I would try to not make strong conclusions here without a head-to-head comparison

    How many slices were used for each training instance? Usually this has to be downsampled, as all slices can’t fit on the GPU. Was the CT cropped to the lungs and sampled at regular intervals? This is important for reproducibility.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Important application, great writing and communication overall and especially in the methods, and strong performance compared to the state-of-the-art (even given the already mentioned caveats).

    Could have been slightly better in terms of some pre-processing details and explanation between differences in validation and test results.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    This paper proposes a two-stage approach to predict the pulmonary embolism in CTPA images. Instead of 3D networks, they tried used the two-stage 2D CNN-LSTM to obtain better results. The proposed method was trained on a public dataset consisting of 7279 studies, and was test on an in-house dataset consisting of 106 studies. In addition, the multi-instance learning is used as an attention module to enable prediction for study-level labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors propose a novel approach for detecting pulmonary embolism. They combine the multi-instance learning mechanism and the label-consistency constraint to generate clinically meaningful labels that characterize pulmonary embolism, not only its presence but also its location, associated type, etc. It is practical in clinical applications.
    2. The proposed method is evaluated on the largest publicly available PE dataset and an AUC score of 0.82 is obtained in detecting PE.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The comparison experiments are not well organized and are not strong. The author only compared two baseline methods, i.e., a CNN classifier and a CNN-LSTM without attention, which can be regarded as an ablation study. Although the PENet is mentioned as the state-of-the-art method in this paper, it is not compared in the same experimental settings.
    2. The authors only reported the AUC of the three comparison methods on the in-house dataset (D3). Given that the RSPECT dataset (N=7279) is much larger than the in-house dataset (N=106), it makes more sense to report the performance in the RSPECT dataset.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is clear for readers to reproduce the paper based on its detailed description.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. The authors could partition an internal test dataset from the RSPECT dataset to evaluate the compared methods.
    2. The proposed method should be compared with PENet within consistent experimental settings.
    3. Since the Kaggle challenge uses loss (weighted log loss) as the main indicator, it is best to report the loss for a comprehensive assessment of all categories.
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose a clinically meaningful method. However, the experiments need further improvement.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Reviews are split and rebuttal will be needed. The proposed “Attention based CNN-LSTM Network” has been exploited in the last few years by many previous work so the technical novelty is quite limited. Using a large public dataset is a plus but the final decision will be made after the rebuttal, on whether adequately addressing the most critical reviews or not.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8




Author Feedback

We thank the reviewers for the excellent suggestions and for appreciating our framework’s clinical applicability to improve current workflows. We have provided responses to the major critiques.

R1, R2, R3: Comparison with state of the art methods A. As mentioned in the paper, our study-level performance is similar to PENet (state of the art); however, unlike PENet, our model provides slice-level predictions as well as predictions for associated PE attributes. Unlike previous works which have dealt with presence/absence of PE in a CT volume, we have multiple labels corresponding to each study and individual slices; therefore, we haven’t directly compared our slice-level results. As suggested by R2, we will mention this as a limitation. We have now compared our method against one of the best performing methods [https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/discussion/193401] in the RSNA PE detection challenge. On D3, this method achieves AUCs of 0.66, 0.58, 0.61, 0.6, and 0.6 for prediction of study-level PE presence, left sided PE, right sided PE, RV/LV > 1, and RV/LV < 1, respectively. Our corresponding metrics were 0.95, 0.96, 0.94, 0.96, and 0.74, respectively. EfficientNet (for feature extraction) was found to be optimal and resulted in lower loss when compared to ResNet.

R1 1) Proposed framework is incremental. A. We respectfully disagree. As mentioned by R2, the hierarchical ‘combination of traditional CNNs with LSTM and attention-gating’ is novel. We believe the contribution is significant not only due to the large-scale multi-institutional validation, but also because of how the framework is designed to explicitly obey a clinically-defined label hierarchy. Our framework is among the first to identify forms of PE on each CTPA slice as well as other attributes such as laterality, chronicity, and the RV/LV ratio which can prove clinically useful in determining patient risk. Our model is also designed to mimic the human cognitive process when examining cross-sectional scans. Furthermore, the generated CAMs have been validated by radiologists.

R2 1) Discussion regarding differences between validation/testing - may indicate that the test set is highly unbalanced. A. The test set is well balanced as shown in Table 1. As mentioned in the paper, a possible reason for improved performance may be attributed to the difference in slice thickness and higher number of slices per study in D3 as compared to those in D1 and D2; this potentially provides more contextual information to the model. 2) Expertise of readers? A. Readers (>15 yr and >10 yr exp) are board certified radiologists. 3rd reader (>3 yr exp) is a radiology resident. 3) Confusion re: label consistency. Results implemented at inference or included in training? A. Our model ensures logical consistency of the predicted labels using a constraint-based modified activation function. It is included in the network both during training and inference, as the modified activation is one layer in the model formation.
4) Why RV/LV > 1 and < 1 both considered categories since they are complementary? A. The RSPECT dataset provides two different labels - RV/LV > 1 and RV/LV < 1. It can be only one of these and must be present if at least one image is positive for PE in the study. 5) #slices used? Were CTs cropped/sampled at intervals? A. We use all the slices present for a given study. CTs were neither cropped nor sampled.

R3 1) AUC reported on in-house dataset. Makes more sense to report performance on the larger RSPECT dataset. A. As suggested, we have also reported our result on the validation set (D2 N=1455) which is a subset of RSPECT not used for training (see Fig 3). 2) Kaggle challenge uses weighted log loss; report this for a comprehensive assessment. A. We will include this in the updated manuscript. We achieved a loss of < 0.18 on the Kaggle test set and were ranked among the top X teams (anonymized) in the final leaderboard.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper has sufficient technical novelty (not extremely novel but sufficient). Both the method section and experimental section are well presented with adequate details and clarity. The evaluation performance is sufficiently good and compared against other methods/options. Overall it is a solid poster paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main issue was about the novelty because previous works proposed CNN-LSTM. The authors addressed for this by explaining about the difference in architecturing: the propose method used ‘the hierarchical ‘combination of traditional CNNs with LSTM and attention-gating’.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper is well organized and written. The 2D CNN-LSTM structure with attention-gating in novel. I think the strength outweigh the weakness and the paper is adequate for MICCAI publication.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8



back to top