Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Aishik Konwer, Joseph Bae, Gagandeep Singh, Rishabh Gattu, Syed Ali, Jeremy Green, Tej Phatak, Prateek Prasanna

Abstract

COVID-19 image analysis has mostly focused on diagnostic tasks using single time point scans acquired upon disease presentation or admission. We present a deep learning-based approach to predict the lung infiltrate progression from serial chest radiographs (CXRs) of COVID-19 patients. Our method first utilizes convolutional neural networks (CNNs) for feature extraction from patches within the concerned lung zone, and also from neighboring and remote boundary regions. The framework further incorporates a multi-scale Gated Recurrent Unit (GRU) with a correlation module for effective predictions. The GRU accepts CNN feature vectors from three different areas as input and generates a fused representation. The correlation module attempts to minimize the correlation loss between hidden representations of concerned and neighboring area feature vectors, while maximizing the loss between the same from concerned and remote regions. Further, we employ an attention module over the output hidden states of each encoder timepoint to generate a context vector. This vector is used as an input to a decoder module to predict patch severity grades at a future timepoint. Finally, we ensemble the patch classification scores to calculate the patient-wise grades. Specifically, our framework predicts zone-wise disease severity for a patient on a given day by learning representations from the previous temporal CXRs. Our novel multi-institutional dataset comprises sequential CXR scans from N=93 patients. Our approach outperforms transfer learning and radiomic feature based baselines on this dataset.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_79

SharedIt: https://rdcu.be/cyl6W

Link to the code repository

https://github.com/AishikKonwer95/Prog_Cxr_corrGRU

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This study proposes a multi-scale GRU model that takes in serial chest X-ray data of covid patients to predict severity of covid lesions in 6 spatial lung regions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the paper is well presented, the data set well defined, and the model is described appropriately.
    • this model would also be generalizable to other situations with serial lung imaging data or beyond.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • it’s not clear how clinically useful the model would be if several time points are needed to predict the severity.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    the authors have a proper evaluation method, the results appear robust and as such are likely to be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    I would recommend the authors to put figure 2 in the beginning of the methods section and first refer to it before describing the model in details, this will make it easier for the reader to understand the method.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • the paper is explained well and the clinical use case is correct.
    • the results are interesting and properly presented.
  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    The paper presents a method of for predicting COVID-19 severity scores for 6 lung zones given a time sequence of chest images. The method is composed of the following key components: (1) an extended GRU architecture that operates on a sequence of patch triplets coming from different lung zones, (2) a correlation loss for the patches, and (3) an attention module on the time sequence.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Based on the authors’ claim, and a small search I did, there is no previous study that classifies the progression of COVID-19 based on a sequence of chest X-rays images.
    2. The presented architecture seems sound, as well as the overall analysis.
    3. The correlation loss seems novel.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The overall idea of applying CNN-RNN (with LSTM/GRU) for learning from longitudinal image data is not novel, e.g., see Xu, Yiwen, et al. “Deep learning predicts lung cancer treatment response from serial medical imaging.” Clinical Cancer Research 25.11 (2019): 3266-3275. images
    2. The use of attention module with RNN for learning the importance of timepoints is also quite prevalent.
    3. The experiments in the study are based on a small dataset of 93 patients.
    4. The labels, which are based on human annotations, do not seem standard, and there is no information on them, apart that they correspond to 3 levels of severity. Therefore, it may be difficult to reproduce this study, or compare its results to other studies.
    5. Finally, there are some unclear details on the method and experiments, which are elaborated below.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There are missing/unclear details in the description of the method, and therefore it would be difficult to implement it.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Major comments:

    1. Can you provide more details on the meaning of the severity, e.g., the associated risk with each severity level?
    2. The handling of time values was unclear to me at the beginning, and only at Section 3.2 (implementation details), the following information appears: “We used pack padded sequence to mask out all losses that surpassed the required sequence length. Thus, we could nullify the effect of missing timesteps for a patient in the dataset.” I am familiar with pack_padded for handling sequences of unequal size, but how do you handle missing observations that are in the middle of the sequence? If there are no missing observations in the middle (i.e., the observations for each patient are from consecutive days without gaps), then why do you use the notation t_1, t_{d} to denote time values, and not simply 1, …d ? Another option is that the value of the timepoint is taken into account by the model, but this is not mentioned. Please revise the text to explain this.
    3. Patch selection: the definition of neighbor and remote zones is unclear. Why is R1 considered a neighbor of L1, but R2 is not considered a neighbor of L2? Also, what are the remotes zones for the middle area L2?
    4. Section 2.4, Multi-Scale GRU Section: need to update that w^i_t, i=1,2,3, are also learned parameters.
    5. Multi-Scale GRU Section: “X_t is the feature vector of each patch”. X_t is not mentioned. Maybe you intended to explain X^i_t, i=1,2,3, instead of (the non-existent) X_t?
    6. Section 2.4, Correlation module: how is the correlation computed? As Pearson correlation (i.e., normalized dot product)?
    7. Equation 9: The equation should specify on which elements the maximum and minimum computed. After reading the paper I assume that these are computed on all patients, independently for each patch.
    8. Section 2.4, Attention module: the reference to the number of analyzed time points is confusing. For example, consider the phrase “the available t_{d-1} timepoints”: isn’t the number of processed time points equal (d-1)? If you do process t_{d-1} timepoints, what is done for timepoints with unobserved images?
    9. Decoder Section: “For each patient, we predict 16 such patch classification scores”. To make clearer I suggest rephrasing: “For each patient AND ZONE, we predict 16 …”
    10. Why doesn’t the method (encoder/decoder) use the image from timepoint t_d?
    11. Section 3.1 Dataset description: can you provide some statistics on the labels and sequence length?
    12. First Baseline is unclear: at start it is mentioned how to extract a feature vector from time points t_1, …, t_{d-1} “we obtained a P X 4096 feature vector where P denotes the total number of patches extracted for a patient from the L1 zones of images collected from multiple timepoints t_1, t_2,…,t_{d-1}.” These featured, from P patches, are aggregated with a simple average. Therefore, it is unclear to what majority voting is applied in the next step. Maybe the P patches include only one patch 1 per timepoint, and the majority vote is done on patches? In this case I P would be t_{d-1}.
    13. Second baseline is unclear as well: were the radiomics features extracted only from the last timepoint? If not, how did you handle the variability in sequence length?

    Minor comments:

    1. Methodology Section: “The images corresponding to these D timepoints” ==> “d timepoints”
  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The application of the proposed study seems novel (COVID-19 related predictions from time sequence of images), and the overall architecture and analysis seem sound, although not very novel. As there are some unclear parts in the paper, its presentation should be improved.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    3

  • Reviewer confidence

    Somewhat confident



Review #3

  • Please describe the contribution of the paper

    This paper presents a deep learning-based approach to predict the lung infiltrate progression from serial chest radiographs (CXRs) of COVID-19 patients.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This proposed algorithm exploits the temporal and spatial dependencies of CRX findings to predict COVID-19 progression
    2. The multiscale approach allows for this method to accept inputs from different regions at the same timepoint
    3. Overall, this method outperformed the radiomics and transfer learning approaches
    4. This method is very robust and doesn’t require image registration for images at different time points
    5. The research addresses a very relevant topic in COVID-19 progression and prognosis, as it has affected the world and needs as much research as possible to combat it
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. A comparison between the computational complexities (or time spent running the models) between the proposed and the baseline methods would help show another strength of the model, or weakness
    2. Limitations and weaknesses of the model should be given to show what could be improved upon for future research
    3. Accuracy is a good metric, but it isn’t the only metric that tells the story of the strength of a method
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors give the structure of the network, the parameters used, and the folds of cross validation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    This is a very good paper, and it touches on a topic that has crushed the world. This work could be accepted off the strength of the topic alone, as well as the substance of the algorithm. I would’ve liked to see more data and comparisons between the proposed method and the baselines.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic alone is worth fast tracking the paper. It stands on its own merit as an algorithm, notwithstanding. I enjoyed the paper, and outside of my criticisms it needs to be displayed.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper incorporates a multi-scale GRU module, a correlation loss, and an attention module for COVID-19 progression prediction based on a sequence of X-ray images. Although not very novel, the overall method sounds, and the results outperform those of radiomics and transfer learning. The overall methodology contribution is possitive, but the critiques form reviewers need to be addressed properly.

    Please refer to the detailed constructive comments of Reviewer #4 for possible improvement.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

We thank the reviewers for their insightful comments and feedback on improving the quality of our manuscript. Below we have responded to the major concerns. We will incorporate them in the paper/supplementary as necessary

1) The labels do not seem standard. More details on severity levels (R4) A. Each score was determined by agreement among three expert readers (≥15, ≥3, and ≥2 years of experience, respectively). This system mirrors the formulation of other scoring systems (Kwon et al. Radiology: AI 2020, Balbi et al., Eur Rad 2020). 0 is assigned for a lung zone in which there are no radiographic findings; 1 - existence of ground glass opacities; 2 - opacities with confluent bronchograms.

2) How to handle missing observations in the middle of the sequence? (R4) A. Timepoints t_1, t_2, …, t_d are not consecutive timepoints. The duration between these timepoints are not factored in our model. We only handle unequal sequence sizes which we described in the Implementation section.

3) Definition of neighbor, remote zones is unclear. (R4) A. Since imaging findings suggest that COVID infiltrates spread gradually from lower to upper zones, we considered L1 and L3 as neighbors of L2. Remote patches (Rp) - patches from far-off boundaries of neighbors and Neighbor patches (Np) - patches from closer boundaries of neighbors. For example: Rp of L2 were extracted from far-off boundaries of L1 and L3, whereas Np of L2 were extracted from closer boundaries of L1 and L3.

4) Updates in the equations of Multi-Scale GRU Section (R4) A. We shall update that w^i_t, i=1,2,3, are also learned parameters. Also, X^i_t, i=1,2,3 are the CNN feature vectors of patches from the three primary, neighbor, and remote zones.

5) How is the correlation computed? (R4) A. Pearson correlation coefficient has been used. For all patients, independently for each patch from Pp and Np zones, we maximized the correlation function. Similarly we minimized the correlation function for each patch from Pp and Rp zones.

6) Why doesn’t the method use the image from timepoint t_d?(R4) A. We use the encoded representation from the first d-1 images to predict the severity scores at timepoint t_d.

7) Dataset statistics on labels and sequence length?(R4) A. We will include a figure in the Supplementary showing the normalized distribution of severity grades across all timepoints.

8) Attention module: the reference to the number of analyzed time points is confusing (R4) A. The number of processed timepoints equals ‘d-1’ and not ‘t_d-1’. This will be corrected in the manuscript. All observed timepoints t_1,…,t_d are associated with an image. As mentioned in #2 above, these timepoints are not equally spaced.

9) First and second baselines not clear. (R4) A. In Baseline 1, each patient had 16 sequences since an image from one timepoint was divided into 16 grids. Majority voting is done on the output severity scores for each such sequence in order to obtain the final patient-level severity score. We utilized only one timepoint per patient as input in our radiomic approach. Since the approach was not temporal, we did not have to handle variability in sequence length.

10) The experiments are based on a small dataset of 93 patients. (R4) A. Public datasets have limited temporal information. Since we have a very unique dataset that comprises temporal scans, the total number of unique images from N=93 patients is 621. Owing to the limited N, we evaluated our methodology in a cross-validated fashion.

11) A comparison between the computational complexities (R5) A. We implemented our framework on a server with 11gb Nvidia RTX 2080 Ti gpu. Each model in the proposed approach was trained in ~3.4 hours for 30 epochs. Baseline 1 and 2 took ~2 hours and 1.25 hours respectively.

12) Acc. is a good metric, but should not be the only metric. (R5) A. Besides accuracy, our analysis included commonly used metrics such as Precision and Recall. We also provided the kappa scores.



back to top