Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, Xian Wu

Abstract

Recently, medical report generation, which aims to automatically generate a long and coherent descriptive paragraph of a given medical image, has received growing research interests. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias: the normal visual regions dominate the dataset over the abnormal visual regions, and 2) the very long sequence. To alleviate above two problems, we propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules: 1) AHA module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The acquired disease-grounded visual features can better represent the abnormal regions of the input image, which could alleviate data bias problem; 2) MGT module effectively uses the multi-grained features and Transformer framework to generate the long medical report. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets. Moreover, the human evaluation conducted by professional radiologists further proves the effectiveness of our approach.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_7

SharedIt: https://rdcu.be/cyl3J

Link to the code repository

N/A

Link to the dataset(s)

IU-Xray: https://openi.nlm.nih.gov/

MIMIC-CXR: https://physionet.org/content/mimic-cxr/2.0.0/

Reviews

Review #1

Please describe the contribution of the paper

The paper presents an approach to generate radiological reports for a given radiological image. The authors used two previously developed models, one to extract visual features and another to extract the disease tags associated to an input image. These two sets of features are the input for the work presented here, where they first have a module (AHA) that “aligns” these two sets of features, and a second module (MGT) that reconstruct the report. The approach is evaluated on two publicly available databases, showing higher performance in report generation than previous models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

From the two main components presented on the paper, AHA and MGT modules, it seems that the AHA is just an iterative process that uses the previously developed Multi-Head Attention module. The MGT module seems to be the only novel module presented here. The evalutaion seems fair with small increments over the state-of-the-art, and the manual evaluation by radiologists on 200 generated reports is a plus.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There are missing details that make difficult to evaluate whether the paper is strong or not. The authors specified that each database (2 in total) was split in train/val/test and it is the reaser who should infer that they only used the train sets for training the proposed model, however, this is not clear. In addition, they relied on 2 pre-trained models to extract visually relevant features and associted tags, and it is not clear whether the data from the databases used in this work was used or not on the pre-training. Which data was used, how it was used, and where it was used, should be clear.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The 2 databases used for evaluation are publicly available and they provided the code of their experiments. However, the pretrained models used to extract the “visual” and “tag” features are not available. There are several details missing about the implentation, but with the code released this should be possible to overcome.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper reads well and the general idea of the paper can be easily followed, however, there are many essential details that are missing or not clear. Following my previous comments, the data used to train the different components is not clear. Moreover, it is not clear how they translated the written reports in a set of embeddings (word embedding and position embedding).

The increment obtained in the different perfomance metrics is quite small with respect to the state-of-the-art. The standard deviation or variance should be added to be able to compare against other approaches. Moreover, how the results for the other models were obtained should be explained. The results are exactly the same than the ones reported in [8]. If they were extracted from there, this should be specified.

Regarding the manual evaluation, the results do not correspond with the method described. From the explanation, I understood that from a set of 200 images, they generated the associated report with their aproach and the best state-of-the-art approach. Then the radiologists had to choose which report was best. From this, one would expect that the column “hit” would sum 100% for each DB, but it does not.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The fact that there are several key elements unclear make difficult to decide whether it is a strong paper or not. The model seems novel but reuse several pre-existing modules. In addition, the fact that it is not clear which data is used in every part is critical.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper
1. This paper proposes the Align Hierarchical Attention Multi-Grained Transformer framework to overcome the data bias and very long sequence challenges for the medical report generation task.
2. The experimental result is sufficient and well demonstrates the proposed method’s effectiveness.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed framework AHA-MGT seems to resolve the facing challenges to a certain extent.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The main weakness lies in the novelty of the proposed framework that simply combines the multiple head attention method and the transformer method, which are existing yet popularly used in the image captioning domain.
2. The long sequence challenge is already resolved in the existing image paragraph generation works [1] and storytelling works[2].
  - “A Hierarchical Approach for Generating Descriptive Image Paragraphs”
  - “Visual Storytelling”
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Since this paper provides original codes, the reproducibility of the paper is fine.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. Suggest that the paper should review more mage paragraph generation works and make a full comparison with them.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of the proposed framework.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper proposes an Align Hierarchical Attention-Multi-Grained Transformer framework (AHA-MGT): 1) Align Hierarchical Attention (AHA) module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the proposed AHA-MGT framework can achieve state-of-the-art results on the two datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The idea of Align Hierarchical Attention, which performs the alignment between visual regions and disease tags, is novel. The results on both IU and MIMIC dataset shows the model achieves significant performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

N/A
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author submit the code, the code is clean and clear.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

N/A
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea is nice, the results are significant.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Somewhat confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Please see strengths and weaknesses of the paper summarized below. Please try your best to address the items under weaknesses and answer reviewer questions in your rebuttal.

Strengths:
- The two proposed modules, (AHA) that “aligns” these two sets of features, and (MGT) that reconstructs the report, are interesting and appear to be novel.
Weaknesses:
- Some details are missing. Two pre-trained models are used to extract visually relevant features and associated tags, but it is not clear whether the data from the databases are used for fine-tuning. Which data was used, how it was used, and where it was used, should be made clearer
- Need to clarity further about novelty, and respond to Reviewer #2’s comments about existing related works.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Author Feedback

We sincerely thank all the reviewers and chairs for their time, effort, and valuable comments.

Response to Reviewer #1: Thanks!

Q1: Pre-trained models to extract visually relevant features and associated tags? A1: As stated in the first paragraph of page 4, the pre-train models are built in the following three steps: 1) We adopt the ResNet-50 backbone pre-trained on ImageNet and further pre-trained on ChestX-ray8 dataset, which consists of 108,948 X-ray images with each labeled 14 common radiographic observations. Existing works like [21][15][38][20] also use ChestX-ray8/ChestX-ray14/CheXpert to pre-train their CNN encoders. ChestX-ray8 is independent of the two datasets IU-Xray and MIMIC-CXR used for evaluation in this paper. 2) For both IU-Xray and MIMIC-CXR datasets, each image has been labeled with multiple tags. Therefore, we follow [16][38] to add a multi-label classification (MLC) network to the ResNet-50 and fine-tune with the training set of each dataset. 3) After training the ResNet-50 + MLC for each dataset, we utilize the output of MLC to extract associated tags and use the last average pooling layer of ResNet-50 to extract 2,048 7x7 visual feature maps, which are further projected into 512 7x7 visual feature maps.

It is worth noticing that we only adopt the training set to build the pre-train model. In validation and testing, we directly input the test images into the ResNet-50+MLC to extract the visual features and associated tags.

Q2: Word embedding, position embedding? A2: We adopt the randomly initialized word embedding with dimension d and use the sine and cosine functions of different frequencies as the fixed positional embedding with dimension d as in [32].

Q3: Human evaluation. A3: The reported results in Table 1 of our paper indicate the winning percentage, thus the differences between 100% and their sum represent the percentage of ties.

Response to Reviewer #2: Thanks!

Q1: Multiple head attention method and transformer method. A1: In our work, we aim to focus on extracting disease-grounded visual features to alleviate the data bias problem to boost the performance of the medical report generation task. To this end, in the proposed model, we use the Multi-Head Attention (MHA) as a basic component to align the visual regions with the disease tags to acquire disease-grounded visual features. Therefore, the novelties of the proposed models are: (1) To alleviate the data bias problem, we conduct the alignments between visual regions and disease tags with MHA in an iterative manner and prove that increasing the depth can improve the performance to achieve the best performance (Table 2). (2) We propose a multi-grained transformer to combine disease-grounded visual features at different depths.

In brief, although the basic component is adapted from the existing work, the objective and motivation of our work are unique and novel. Thus, it is non-trivial in our work to use the multiple head attention method and it’s the probable means to our goal.

Q2: Long sequence generation. A2: For the long sequence challenge in the existing image paragraph generation and storytelling tasks, our task in this paper is significantly different from them. As stated in related works, in paragraph generation for medical images, the correctness of generating abnormalities should be emphasized more than other normalities, while in paragraphs of natural images each sentence has equal importance. As a result, due to the data bias problem in the medical domain, the widely-used model in the image paragraph generation does not perform very well in medical report generation and is tend to generate plausible general reports with no prominent abnormal narratives [15][21][36], e.g., generate some repeated sentences of normalities and fails to depict some rare but important abnormalities.

Response to Reviewer #3: Thank you for the positive comments!

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

In the rebuttal, the authors provided more pretraining and implementation details, and commented on the novelty concern raised by reviewers. I think the rebuttal is reasonably convincing, and I recommend acceptance of the paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The reviewers have risen several concerns regarding clarity of the subcomponents in the initial submission. The rebuttal has addressed most of those concerns. The authors are strongly encouraged to address those in the final paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

For medical imaging report generation, there is no way to bypass how to extract the correct disease tags, how to learn by classifying correctly on these critical tags and hopefully returning their locations for follow-up verification. It is like to generate a structured report first with all facts are checked. This paper does not follow in this way and it is too complex to learn “sentences” which are hardly to be fact-verified.

All experimental results reported in Table 1 and Table 2 are using BLUE scores where these BLUE scores has very weak quantitative assessment on actually how the image diagnosis accuracy/precision are obtained and at which level. BLUE score is a biased metric on the most important aspect of imaging diagnosis: to get the correct diagnosis at the first place, instead of speaking like naturally-speaking reports.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

19

back to top

AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation