Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Ivona Najdenkoska, Xiantong Zhen, Marcel Worring, Ling Shao

Abstract

Automating report generation for medical imaging promises to reduce workload and assist diagnosis in clinical practice. Recent work has shown that deep learning models can successfully caption natural images. However, learning from medical data is challenging due to the diversity and uncertainty inherent in the reports written by different radiologists with discrepant expertise and experience. To tackle these challenges, we propose variational topic inference for automatic report generation. Specifically, we introduce a set of topics as latent variables to guide sentence generation by aligning image and language modalities in a latent space. The topics are inferred in a conditional variational inference framework, with each topic governing the generation of a sentence in the report. Further, we adopt a visual attention module that enables the model to attend to different locations in the image and generate more informative descriptions. We conduct extensive experiments on two benchmarks, namely Indiana U. Chest X-rays and MIMIC-CXR. The results demonstrate that our proposed variational topic inference method can generate novel reports rather than mere copies of reports used in training, while still achieving comparable performance to state-of-the-art methods in terms of standard language generation criteria.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_59

SharedIt: https://rdcu.be/cyl4T

Link to the code repository

https://github.com/ivonajdenkoska/variational-xray-report-gen.git

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors present a novel probabilistic model for automated report generation for chest X-ray images, casted as variational topic inference. In more detail, the architecture contains a transformer based encoder of imaging information followed by the LSTM-based decoder of textual information. The encoder-decoder design is equipped (during training) with a textual signal reconstruction branch.

The method achieves convincing performance on two benchmark datasets under a range of metrics and qualitatively allows more diverse text generation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It is a well-written paper describing a novel methodology. A thorough comparison with existing methods is provided showcasing superiority of the proposed method over multiple evaluation criteria.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I did not spot much except some minor typos mentioned below
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility checklist seems to correspond to the manuscript.

Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

“We define a variational approximate posterior q φ (z) to approximate the intractable true posterior p φ (z y, x) by minimizing the KL divergence between them”. Here, should be p θ (z y, x)

“During training, the samples are drawn from the variational posterior distribution z (l) ∼ q θ (z y)” Here, should be q φ (z y)

“which yields a sequence of n word embedding {e [SENT] , e 1 , e 2 , …, e n }, where e i ∈ R d_e”. I guess e [SENT] does not belong to this brackets as it represents the whole sentence token. Could you also specify what d_e is, for the sake of completeness?

Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It is a paper that demonstrates a methodology reaching the SOTA performance over several metrics with a very clear and smooth narrative.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors develop a network that generates radiological interpretations of chest x-rays. The network explicitly models each sentence as a different topic, each with its own visual attention head with respect to the image. Using variational inference, the network can generate diverse interpretations of an image. The paper demonstrates that this method is at least competitive with state of the art deterministic approaches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Well written and well structured. Clear motivation and background, detailed methods and experiments, and great figures. Novel approach in modeling each sentence as a separate topic, each with its own attention head. Good experiments and selection of results showing that their approach grants the advantages of probabilistic modeling without compromising performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I would strongly suggest adding future work. Only minor suggestions otherwise.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Public data, private code
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

This work does leave one wondering how the diversity and uncertainty captured by this model could realistically be used by clinicians or clinical researchers. I don’t believe generating several different reports is a very practical way to capture uncertainty - e.g., it’s hard to imagine that parsing 100 different generated reports is the best way to measure the probability of a given abnormality in a given CXR. This seems like a hard problem/limitation that could be interesting to tackle in future work.

Fig. 3 seems to show some nonsensical generated sentences like “Hyperexpanded low
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Very solid writing and presentation, and enjoyable to read. Clear and important contributions backed up by good results. No glaring weaknesses.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Review #3

Please describe the contribution of the paper

In this paper, the authors proposed a chest x-ray report generation model using variational topic inference. Transformer models are adopted for both visual and textual inputs for topic inference and alignment in the latent space. Each sentence in the generated report is conditioned on a topic vector inferred from the visual input. Experiments show promising results on two benchmarks compared to previous methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The paper is well written and easy to follow. The presentation is clear. 2) The idea of using variational inference in report generation seems to be interesting. As illustrated in Fig. 3, using variational inference can generate various reports given the same image input. Using transformer models to infer topic vectors for both images and reports also seems to be relatively novel. 3) Comprehensive experiments are conducted, and the accuracy evaluation of generated reports using CheXpert labeler assesses the clinical usefulness of different models.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) Authors assume that each sentence is conditionally independent. In this case, how to guarantee the coherence of the generated report consists of multiple sentences? In the variational topic inference, how to connect each inferred topic vector? 2) For the design of the model, an ablation study is lacking to justify the choice of components. For instance, what if one replaces the transformer in the visual topic inference branch with another network? Did the BioWordVec embedding help in the training process? 3) Authors mention that the proposed method can ‘tackle the uncertainty in the chest X-ray interpretation process’ as reports with different descriptions of the same area or object can be generated. How would authors select the final generated report in this case? How to assess the quality of reports generated by different topics? More analyses regarding uncertainty in the generation should be investigated.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Based on the reproducibility checklist, the reproducibility of the paper seems to be satisfactory.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

1) Although showing promising results, the proposed VTI method does not indicate significant improvements over the CC-Transfomer [20], which also leverages the transformer model in the generation process, on several evaluation metrics including clinical efficacy metrics. Could authors elaborate more on the comparison with CC-Transfomer in terms of the quality of generated reports? 2) Grad-CAM heatmap is helpful. As attention is also utilized in the generation of words, showing attention maps during generation could be valuable as well. 3) More ablation study can be provided. In-depth discussion and analysis regarding the uncertainty in the generated report are clearly lacking, which could have been a strength of the paper.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents an interesting idea of using variational inference in report generation and shows promising experimental results. Yet I believe a more in-depth analysis of the uncertainty in the generated report can be added.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

1
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
In this paper, the authors proposed a chest x-ray report generation model using variational topic inference. All three reviewers recommended accept for this paper. The paper is well written. The idea is novel. Experimental evaluation is comprehensive.

Comments on further improvements:
- A more in-depth analysis of the uncertainty in the generated report can be added
- Clarify or discuss in future work how the coherence of the generated report consisting of multiple sentences can be guaranteed, and how to connect each inferred topic vector in the variational topic inference.
- Fix the typos or mistakes pointed out by Reviewer #1, and address other constructive comments by all reviewers.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

We thank all reviewers and the AC for their acknowledgement and constructive comments. We appreciate the positive comments such as “well-written paper describing a novel methodology” and “very solid writing and presentation, enjoyable to read.”

Next we will provide our clarification about common points by the reviewers.

Diversity and uncertainty (R3, R4). We thank the reviewers for the insightful comments. Our probabilistic modeling is motivated by the observation that sentences in a report can be diverse in terms of sentence structures and styles because they are written by doctors with different experience and expertise. By framing the chest X-ray report generation as a variational inference problem, our model manages to capture this uncertainty to better handle the diversity in the training data. Additionally, our probabilistic model has the innate ability to avoid overfitting to training data, therefore potentially offering better generalization performance of report generation. We believe this is crucial when there is a significant distribution shift from training data to test data. The diverse reports shown in Fig. 3 are to demonstrate that our method can provide different variants of reports with different Monte Carlo samples. We can see that different variants describe similar topics but with different sentence structures. As to the usage by clinicians and clinical researchers, our method can produce a single best report by combining the most probable sentences with Bayesian average principle. We will add these discussions about diversity and uncertainty in our final version and also we would like to explore more on this in our future work.

Attention maps (R3, R4). The suggestions for visualizing the attention maps, as an addition to Grad-CAM are appreciated, and will be included in the final version. In particular, we can visualize the attention maps obtained when generating each word in the sentence.

In addition to the above, we would like to address separate comments of the reviewers:

Review #1 Thank you for the helpful comments about the typos, we will fix them for the final version.

Review #3 Nonsensical sentences. This would be due to the relatively weaker LSTM decoder, which sometimes has difficulties to handle the long-term dependencies in sequences and causes nonsensical sentences in a few cases. We expect this to be resolved by using a more powerful decoder such as Transformer in future work, which we will discuss in our paper.

Clinical efficacy metrics. These additional metrics actually serve to evaluate how clinically accurate the generated reports are, by comparing the extracted CheXpert labels between the generated report and the ground-truth reports. We appreciate this suggestion and we will add the explanation in the final paper.

Review #4

Coherence. The sentences are conditionally independent while they are governed by topic vectors that depend on the same image. In this sense, the generated sentences are coherent. We will add this discussion in the paper. Regarding the question about connecting the inferred topics, we would like to mention that each head in the multi-head attention in the Transformer encoder corresponds to a separate latent topic. Following such an approach helps in connecting the inferred topics since each head depends on the whole input with a focus on a certain part.

Ablations. In our preliminary experiments, we observed the benefits of using Transformers and BioWordVec in improving the overall performance and we will add those results in the paper.

Comparison to CC-Transformer in terms of the quality of generated sentences. Thank you for this great suggestion. The CC-Transformer is a deterministic method that produces comparable performance with our model as shown in the paper. Following your suggestion, we will elaborate more on the comparison with more metrics, for instance, the lengths of generated sentences and the semantic closeness to the ground truth.

back to top

Variational Topic Inference for Chest X-Ray Report Generation