Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Bo Liu, Li-Ming Zhan, Xiao-Ming Wu

Abstract

One of the primary challenges facing medical visual question answering (Med-VQA) is the lack of large-scale well-annotated datasets for training. To overcome this challenge, this paper proposes a two-stage pre-training framework by learning transferable feature representations of radiology images and distilling a lightweight visual feature extractor for Med-VQA. Specifically, we leverage large amounts of unlabeled radiology images to train three teacher models for the body regions of brain, chest, and abdomen respectively via contrastive learning. Then, we distill the teacher models to a lightweight student model that can be used as a universal visual feature extractor for any Med-VQA system. The lightweight feature extractor can be readily fine-tuned on the training radiology images of any Med-VQA dataset, saving the annotation effort while preventing overfitting to small-scale training data. The effectiveness and advantages of the pre-trained model are demonstrated by extensive experiments with state-of-the-art Med-VQA methods on existing benchmarks. The source code and the pre-training dataset can be downloaded from https://github.com/awenbocc/cprd.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_20

SharedIt: https://rdcu.be/cyl1K

Link to the code repository

https://github.com/awenbocc/cprd

Link to the dataset(s)

https://github.com/awenbocc/cprd

Reviews

Review #1

Please describe the contribution of the paper

-The authors proposed a novel pre-training method which uses easily-acquired un-annotated radiology images to pre-train a lightweight visual feature extractor from multiple feature extractors that were trained on the larger dataset. -To retain the structural information from the larger model the KLD loss is combined with Costrastive loss. -This lightweight pre-trained model can adapt to smaller dataset to perform VQA task without over-fitting. This claim is justified by performing experiments on Med-VQA and VQA-RAD datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the paper is very clearly written.
- there are comparisons with the state of the art work
- the strategy of using knowledge distillation for pretraining the visual feature extractor in medical domain is novel
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- What are the datasets used while training the teacher model? are they publicly available?
- How is the preprocessing of images done while training the teacher visual models? Are there any transformations done during data augmentation for contrast, brightness and saturation with normalization so that the input is uniform across multiple datasets?
- How are the values of the temperature parameter (T) and number of images (K) decided while training the student model? Are there any ablation studies for deciding these parameters?
- Why is LSTM being used as a text encoder and why are the latest encoders like BERT/ClinicalBert not considered?
- In the tables 1 and 2. for obtaining the results of Open Ended (Open) column, where the output text is a form of free flowing text, what was the structure of the network used?
- Equation 7. looks like it is used for predicting “Close Ended” (Close) questions, what is the loss function for predicting the answers for “Open Ended” (Open) questions? Is it still Equation 7? If yes how?
- For Open ended question, is Accuracy a correct measure of evaluation? Are there other scores like BLUE, ROUGE, CIDEr, etc which could be calculated?
- From figure 1c, what is the definition/equation of Lvqa? Is it Equation 7?
- In my opinion, to correctly evaluate the efficacy of the representations from the student model one more experiment like Image to Text/Text to Image retrieval involving both text and image representations should have been performed and the recall results reported.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- the pre-train datasets used for teacher training are not mentioned.
- lack to clarity for modelling the loss function for free flowing text apart from the above two points, the paper is mostly reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- If possible please provide the datasets used for teacher training.
- report the results using state of the art text encoders
- mention the augmentation and normalization techniques as it will help reproducibility
- other multimodal experiments than classification needed to prove the efficiency
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- the approach is novel
- classification accuracy is much better than related work
- accuracy is not sufficient as in the Close Ended answers the data bias is not mentioned, need more metrics like precision, recall and more tasks like image to text, text to image retrieval.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper addresses a medical visual question answering (Med-VQA) task using self-supervised learning coupled with a knowledge distillation framework. The proposed method was evaluated using 2 public datasets. The results show that the proposed method had higher accuracies compared to other existing self-supervised learning methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea of incorporating contrastive learning into a knowledge distillation framework was interesting.
2. The collection of different medical imaging modalities including X-ray, computed tomography (CT) and magnetic resonance imaging (MRI) was good.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The authors’ idea of collecting only 3 human body regions: brain, chest and abdomen was not clearly justified. I agree that many existing studies focus on these 3 regions but this does not mean that the proposed method will be beneficial for all radiology images.
2. The authors collected different types of medical images including x-ray, CT and MRI. However, these modalities have their unique characteristics that can affect the learning process or subsequent Med-VQA task. Did the authors consider this?
3. Some of the Method/experiment setup descriptions are not clear or are missing important details.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors provided most of implementation details of their proposed method.

Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors did not properly define the acronyms throughout the paper. Acronyms need to be defined when they are mentioned for the first time. For example, what is MEVF? 
The technical motivation of incorporating contrastive learning into a knowledge distillation framework was not clearly explained. It would be great to include how the proposed method differ from existing knowledge distillation methods.
The performance of the proposed method seems to heavily depend on the use of parameter α in equation 6. What value did you use in your experiments and how does use of different parameter affect the results?
The description of how the 2 public datasets were split into training, validation and testing sets.
The descriptions of other VQA methods such as SAN and BAN are missing. What are they? Why does your proposed method perform better?
There are a few typos and grammatical mistakes throughout the paper. 

Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed idea was interesting, but the benefits or justifications of idea were presented poorly. There are some missing descriptions of the proposed method. The were no clear discussions why the proposed method performed better than other state-of-the-art self-supervised learning methods.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper presents a two-stage pre-training framework to tackle the challenge of data scarcity in the Med-VQA domain. More specifically, large-scale unlabeled data is used for pretraining for each domain (brain, chest, and abdomen). The three pretrained networks are then used to distill the final network for feature extraction.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This work is well-motivated. Leveraging large-scale unlabeled dataset is a hot and important topic for medical image analysis
- This work is well-written.
- The experiments demonstrate the effectiveness of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Unfair comparison with the existing methods
- The unlabeled setting is fundamentally flawed
- Lack of novelty
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

As the authors is going to release the code to the public, the work is reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- While the proposed method used large-scale domain-specific datasets to pretrain the network, the baseline methods use only the Imagenet-pretrained weights.
- While the visualization in the experiments clearly demonstrate that the network can distinguish different human parts, I am concerned that this is because the authors use the labels to train the network. According to Eq. 5, the authors know the label of each subdataset and used it to distill the final ensemble network. Why not use a pretrained ResNet and finetune on this classification task, and use the representation for VQA?
- This work is essentially a combination of existing methods, which lacks of novelty.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Unfair comparison with the existing methods
- The unlabeled setting is fundamentally flawed
- Lack of novelty
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

8
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Summary: A novel pre-training method using easily-acquired un-annotated radiology images to pre-train a lightweight visual feature extractor from multiple feature extractors that were trained on the larger dataset. Structural information is retained from the larger model by combining the KLD loss with a contrastive loss. Experiments on two public datasets show this lightweight pre-trained model can adapt to smaller dataset to perform VQA tasks without over-fitting.

Positives:
- Clearly written, but with some mistakes.
- Strategy of using knowledge distillation for pretraining the visual feature extractor in medical domain is novel, with an interesting idea of incorporating contrastive learning into a knowledge distillation framework .
- Comparisons against the state of the art work, using a variety of image modalities, demonstrate the effectiveness of the proposed method.
- Code to be made available.
Negatives:
- Some important details are missing. These could be added at a rebuttal stage.
- Might need further experiments, such as assessing on a wider variety of organs - although not for a rebuttal.
- Unfair comparison because baseline methods only used ImageNet training, whereas the proposed method was pre-trained in a domain-specific way. Try to justify why this should be a fair comparison.
- May lack novelty because it is a combination of existing methods. Explain whether the combination of existing methods is integrated in a way that is significantly novel.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We thank all the reviewers and clarify some confusion/misunderstanding below.

Fair Comparison (R4)

Our experiments (Table 1&2) are intended to compare our pretraining and distillation method to MEVF ([17], MICCAI 2019), which is the only baseline that uses a small model and pretrains with medical images, and to other baselines (BAN fw., etc.) that use large models (e.g., ResNet) and pretrain on ImageNet.

The results clearly demonstrate the superiority of our method over MEVF in both accuracy and model size, and that other baselines will severely overfit, indicating the necessity of pretraining with domain knowledge and representation distillation, which is the point of our paper.

– Important details

Due to space limitation, we cannot explain every detail in the paper. We try to provide more information below. We ensure that all experiments are reproducible and will release all codes and datasets (collected from public resources e.g., medicaldecathlon.com) after acceptance.

[R1] Preprocessing: We use the same transformation group as in MoCoV2 [6]. [R1] Open-ended questions: In current research, both open- and closed-ended questions are formulated as classification tasks with the same loss as in Eq 7. The answers cannot be generated, and they must appear in the training set. [R1] Evaluation metric: We follow current practice to use classification accuracy as evaluation metric, but we agree more reasonable metrics should be devised/used. [R1] BERT as text encoder: Large models severely overfit on the small-scale training data. [R1] Lvqa is the loss in Eq 7.

[R1, R2] Parameter analysis: All hyper-parameters are chosen by cross validation based on observing the loss in Eq 2&6. We want to include these studies but couldn’t due to limited space.

[R2] Body regions: The observation is that current Med-VQA datasets mainly contain radiology images of the three body regions, and hence our method is beneficial for most of Med-VQA tasks. [R2] Acronyms: Thank you. We will define them properly in the final version. [R2] Data split: We mention in Sec. 5.1 that we follow the settings in previous papers. VQA-RAD: 85% training, 15% test. SLAKE: 70% training, 15% validation, 15% test.
[R2] α in Eq (6) is 0.9. [R2] SAN and BAN are two different reasoning modules for VQA. BAN fw. and SAN fw. are two VQA models with the corresponding reasoning modules and ResNet as visual extractor and LSTM as text encoder, and they often overfit on Med-VQA datasets.

Novelty (R4)

Our contrastive pre-training and representation distillation (CPRD) method is proposed to address two critical issues in Med-VQA: the lack of training images and the diversity of images in terms of different modalities and organs. Existing self-supervised methods cannot be straightforwardly applied to address these problems. Our CPRD is designed based on the statistics of the Med-VQA domain: the datasets mainly consist of radiology images of three body regions. For each region, we collect publicly available unlabeled data and use contrastive learning to learn intra-region prior knowledge. Then, we use contrastive learning again to distill intra-region knowledge and learn inter-region knowledge, which yields a lightweight model.

As pointed out by R1 and R2, the strategy of using knowledge distillation for pretraining the visual feature extractor in medical domain is novel and can benefit other medical applications such as medical image-text retrieval and medical image captions.

Unlabeled setting (R4)

The unlabeled setting is valid. We collect radiology images for each body region (Head, Chest, or Abdomen) from public resources (the images are already grouped by regions). The collected images are unlabeled without any annotations of organs or modalities.

Also see Fig 2 (left), where Brain CT and Brain MRI are well separated. This shows even with unlabeled data, the distilled student model can capture the differences between different image modalities.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I’m satisfied with the rebuttal to R4’s concern about unfair comparisons, given that the authors compared against MEVF, which had been pretrained using medical image data. One aim of the work was to assess whether pre-training on medical imaging data was better than pre-training on ImageNet, hence the other comparisons. In answering concerns about missing details, I would rather the authors had said something about fixing this in the manuscript, rather than simply provide the missing information for the reviewers. R4 considered the work not to be sufficiently novel, but other reviewers commented on the novelty.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a novel method. The idea for pretraining is convincing, and the method of using the visual “question answer” in a contrastive setting is a nice application to clinical data. The experimental results are sufficient and convincing.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After reading the paper, the reviews and the rebutall, the main critiscim of the paper is in the baselines used. Indeed R2 / R3 bring up an important point which strongly limits the claims of the paper and method. While I appreaciate the authors responses to these points in teh rebutall, the it is insufficient to to make the claims formulated.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

16

back to top

Contrastive Pre-training and Representation Distillation for Medical Visual Question Answering Based on Radiology Images