Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Benjamin Hou, Georgios Kaissis, Ronald M. Summers, Bernhard Kainz

Abstract

Chest radiographs are one of the most common diagnostic modalities in clinical routine. It can be done cheaply, requires minimal equipment, and the image can be diagnosed by every radiologists. However, the number of chest radiographs obtained on a daily basis can easily overwhelm the available clinical capacities. We propose RATCHET: RAdiological Text Captioning for Human Examined Thoraces. RATCHET is a CNN-RNN-based medical transformer that is trained end-to-end. It is capable of extracting image features from chest radiographs, and generates medically accurate text reports that fit seamlessly into clinical work flows. The model is evaluated for its natural language generation ability using common metrics from NLP literature, as well as its medically accuracy through a surrogate report classification task. The model is available for download at: http://www.github.com/farrell236/RATCHET.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_28

SharedIt: https://rdcu.be/cyl8o

Link to the code repository

http://github.com/farrell236/RATCHET

Link to the dataset(s)

https://physionet.org/content/mimic-cxr/2.0.0/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a new system named RATCHET to automatically generate free-text radiological reports for chest X-rays. The system consists of a CNN followed by a RNN architecture and uses ideas from transformers. The experiments and evaluation are done using ~300K images from the MIMIC database. According to the metrics used in the paper, RATCHET produce better reports than other competitors.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The system integrates in a novel way the image features extracted with the CNN into the pipline of a transformer-based RNN, where each next token predicted has a corresponding response in the image feature maps. This makes the system able to provide a new word and an associated attention map to indicate which part of the image contributed to generate that word.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper is really well written and easy to follow, the authors skipped most of the details about the implementation of their approach (see list below). In addition, it is not clear if at test/inference time the authors also used as input the image and the report. This is key to understand whether the system is useful. A system able to predict the radiological report using the radiological report is not useful. In addition, the evaluation (or at least the discussion) doesn’t seem fit to purpose. There is no mention to the fact that in some examples there key radiological findings not mentioned in the predicted reports. For example, in case 2, RATCHET missed edema, and in case 3, it misses lung metastases.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors used a publicly available database and used the predifend splits. However, it is impossible to reproduce the experiments since there are not enough details about the different parts of the system. E.g., what is the size of the vocabulary used.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    While the paper is really well written and easy to understand, the authors skipped the details of what they exactly did. Here there is a list of missing (key) information:

    • How do you apply your system at inference time? It seems you always use the radiological report as an input.
    • What is the vocabulary size?
    • How do you determine the sequence length to be predicted? When do you stop geenarating new words/tokens?
    • What is d_k in equation (2)? What are d_{model} and d_v in the Masked Multi-Head Attention module?
    • How did you choose the unique AP/PA views used in the experiments?

    The claim about the images being symmetric is completely wrong. Human chest is not symetric at all, for example, the heart is usually positioned on the left side, and this makes the left lung smaller than the right lung. There are humans with a condition called dextrocardia, that means that the heart is on the right. As a consecuence, they have all organs placed on the opossite side than most humans. By flipping horizontally the images your system will never be able to detect this condition.

    There is a mention about measuring the clinical accuracy of the generated reports, however, such an evaluation is not present. The word “clinical” suggests that a physician has checked the prediction, or that the evaluation has been produced in a clinical setting. I would remove the word “clinical” and make clear that the evaluation is only against NLP-generated labels.

    There are several key findings missed in the examples shown. There should be a mention to this.

    While the idea of generating an associated attention map to each predicted word seems really useful, it looks weird that words like “a” or “of” have a strong associated activation map (Figure 3).

    The conclusions don’t match the paper content. There is no proof of the “generalization abilities for captioning natural images” of the model presented.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The lack of clarity about the input at test time and the many missing implementation details. If the authors used the original radiological report at test time, I don’t see the benefit of the system presented. In addition, the omit to comment on the clinical relevance of the errors.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper proposes a CNN-RNN model that learns from chest radiographs and radiology reports and at inference time generates image captions from radiographs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A straightforward method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) I don’t see technical contributions in this paper. The authors discussed other prior work in image-to-text generation in the last paragraph of the Introduction section, but didn’t point out how this work differs from the others. To me, this proposed model is pretty much akin to TieNet. 2) The goal/motivation of this work is unclear to me. Are we looking for a good text generator or a good pathology classifier from chest radiographs? Which columns should we focus on in Table 1? 3) The formulation and notations in the Method section are incohorent and uninformative. I can’t even find out what variable the input image is.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    After reading this paper I don’t think I can reproduce the proposed method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    1) Point out how this work diiffers from prior ones. 2) Be clear about the goal/motivation of this work. 3) Make the Method section a coherent flow instead of a patch board of different technical components, which were not even originally proposed in this paper.

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As specified in my comments above, I don’t think I learned anything after reading this paper.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    This work describes an attention based network for automatically generating reports from x-ray images. Attention is used to automatically localize disease ROIs. The method is validated on a large public dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is a paper that demonstrates a new strategy for performing a challenging task: automatic report generation from images. The primary strengths are that it is well written and the methods are well described. Additionally, the results show an improvement in the state-of-the-art, which the authors went through the trouble of (re-)implementing so that a head-to-head comparison could be made.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the authors list the contributions of this work, and also include a short summary of prior work, they don’t explicitly contrast the current work with prior work, so it is difficult to understand how it is better or novel. However, this is minimized by showing that results compared to prior methods are improved.

    One major oversight was the assertion that the human body is symmetric and the training strategy being designed off of that. It simply isn’t true, especially for the heart in an x-ray image. The authors need to address this somehow.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The use of a large open source dataset is a good step in the direction of reproducibility. The authors state the code with be shared if published, but I didn’t see any mention of that in the paper. The methods are described well enough that this research should be reproducible but would likely require a lot of effort without the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Starting a sentence with a [reference] is poor writing style. Better to say “X et al [ref]” than just “[ref].” [25] introduced TieNet -> TieNet was introduced in [25] 14] later proposed –> … was proposed in [14].

    Similarly, “We have re-implemented [23] as the baseline architecture” isn’t descriptive enough. Say, “we implemented the original transformer method presented in [23]”

    “Data augmentation include random left and right flipping only as the body is symmetrical”. The human body is NOT symmetrical. This would have importance consequences for a cardiomegaly diagnosis, for example.

    Figure 3 is difficult to view, the images are tiny.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The clear communication, good description of methods, accurate validation with head-to-head results, and enlightening discussion make this a paper that many readers will be interested in investigating and learning more about.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Somewhat confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Please see strengths and weaknesses of the paper summarized below. Please try your best to address the items under weaknesses and answer reviewer questions in your rebuttal.

    Strengths:

    • The system integrates in a novel way the image features extracted with the CNN into the pipeline of a transformer-based RNN, where each next token predicted has a corresponding response in the image feature maps. This makes the system able to provide a new word and an associated attention map to indicate which part of the image contributed to generate that word.

    Weaknesses:

    • Need clarity on the input at test time
    • The authors discussed other prior work in image-to-text generation in the last paragraph of the Introduction section, but didn’t point out how this work differs from the others. One reviewer commented that the proposed model is like TieNet. There is no direct comparison between this method and other related prior works.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5




Author Feedback

We are pleased that R1 and R4 agree on the clarity and organisation of the paper, as well as its methodological contribution and performance gain upon existing state-of-the-art methods. We will address their main concerns as follows:

Input at test time: [AC, R1] – During inference time, only a CXR image and the text token (Beginning-Of-Sentence) is passed into the model. The model is then run in an auto-regressive fashion, where the next predicted token is concatenated with the input tokens to form the input sequence. Only stopping when an (End-Of-Sentence) token is predicted, or it has reached max sequence (report) length of 128 tokens (128 is the max length in training but could easily be changed to longer sequences). The entire radiological report is only used during the training phase as an input, as the learning objective is to predict the next token of the sequence. The benefit of the transformer architecture is that the model can learn the next predicted token at every point of the sequence simultaneously. We will further clarify the autoregressive nature of the model for the final revision, as well as release all the code including vocab dictionary upon acceptance [R1/R4]. The vocab size in this work is 17734. d_k, d_v and d_model are 64, 64 and 512 respectively [R1].

Direct comparison between this method and others: [AC, R3] – The ‘proposed model is pretty much akin to TieNet’ and ‘didn’t point out how this work differs from the others’ [R3] – this is untrue, we have stated numerous image captioning methods/techniques for CXR in the related works section, and have compared directly to ‘Show, Attend and Tell’ [24] and TieNet [25]; as they fall under the same family of CNN-RNN architectures whilst being the most state-of-the-art (following an arXiv search through recent literature on google scholar including arXiv). This has been identified by [R1] stating “RATCHET produce better reports than other competitors”, and [R4] stating “the results show an improvement in the state-of-the-art, which the authors went through the trouble of (re-)implementing so that a head-to-head comparison could be made”. Furthermore, we have provided citations where comparable results are presented for similar CXR report generation experiments in literature [1, 14]. We aim to show that results are not artificially inflated by special conditioning of evaluation metrics, as the reports are evaluated in its entirety. The disparity in minute difference of scores are attributed to factors such as the tokenization of the language and/or number of report examples, etc… It is also confusing that [R3] states “a straightforward method” as a strength and then state “the Method section are incoherent and uninformative” and not being able to “reproduce the proposed method” later as a weakness.

‘The human body is asymmetric’ [R1/R4] – In our experiments, we only used AP/PA views from the MIMIC-CXR dataset as specified by the provided metadata. We omitted the use of lateral images as image features are likely to be drastically more different compared to AP/PA (i.e. It is more symmetric than lateral, but not completely symmetric). To complement the complete clinical pathway, we can re-train the model with only AP (150488 images) or PA (103363 images) alone.

‘clinical accuracy of the generated reports’ [R1] – the output of our method has been validated by a clinician, but not through a crafted evaluation setting involving multiple clinicians. By using Stanford’s NLP disease label extractor on the generated text, and comparing the performance to CheXNet, we have shown that RATCHET is just as accurate with the added bonus that the predicted outcome is of the form of a radiology report.

We hope we have addressed the main concerns. We believe that our work would be of interest for the attendees of MICCAI 2021 and beyond.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors’ rebuttal provided more details on the input at test time and other implementation details. Although TieNet may not be considered the current state-of-the-art, authors’ reply addressed the concern regarding the comparison to other methods reasonably well. Such discussions should be included in the revised manuscript. More discussions on the validation results by a clinician should also be included in the final version, and the usage of horizontal flipping augmentation is not appropriate as even AP/PA views are not symmetric. I recommend acceptance of the paper, but authors should try to address the remaining concerns in the final version if accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    12



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I share the concerns brought up by the reviewers. One of the reviewers asked for clarification for differences from TieNet, but in my opinion, the rebuttal does notadequately answer this point It points to the related work section, but that section states that TieNet is a CNN-RNN architecture just like the RATCHET algorithm. The difference between the two are not explicitly stated as far as I can tell. The rebuttal also points to comparison to TieNet in the experiments section, but this is unrelated to the question brought up by the reviewer. I also share the concerns about the motivation of the paper. Radiologists typically dictate their report in the form of speech while analyzing the image. Dictation of the report doe not take much additional time. It is stated that a radiologist would take ultimate responsibility which would require analyzing the image; therefore, I don’t really see the time savings that can be gained with this method. This would need to be studied in a clinical study involving radiologists preparing reports with and without the help of RATCHET.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    14



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper addresses an important problem in medical image analysis. Unfortunately, I agree with reviewers that the reproducibility of this work is still a concern as there are still many sections that require further clarifications. However, I think that the task of generating radiology report is an important and challenging task that requires further investigations and therefore the paper could be interesting for many other researchers in MICCAI community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5



back to top