Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Ruizhi Liao, Daniel Moyer, Miriam Cha, Keegan Quigley, Seth Berkowitz, Steven Horng, Polina Golland, William M. Wells

Abstract

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_26

SharedIt: https://rdcu.be/cyl2w

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

• The authors proposed a representation learning approach by maximizing the mutual information between local features, instead of global features, of images and text. • The authors experimentally demonstrated that maximization of local mutual information has advantages over its counterpart with global information.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

• Multimodality representation learning with both image and text is not new. But learning with local features, instead of global features, is novel. Furthermore, maximizing local mutual information is more reasonable to handle a radiology report, in which different sentences usually describe different regions of an image. • The authors provide a brief augment that the sum of local mutual information is the lower bound of global mutual information, supporting their methodology from a theoretical perspective. • With two downstream classification tasks, the authors experimentally demonstrate the advantages of learning with local features, especially when they are fine-tuned. • The manuscript is well written and organized.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

• The size of local features is hard to define. As illustrated in Fig. 1, different sentences in the report describe different local regions with different locations and sizes. Therefore, how to get local features with proper sizes may be critical. However, the local features utilized in this manuscript are all from block 5 in ResNet with fixed sizes. There is no justification or ablation study regarding this setting. • It is not clear how the image encoder, especially the one used in the “image-only” experiment, is initialized. If all image encoders are randomly initialized, is there any advantage of the proposed method compared with fine-tuning from ImageNet pre-trained models?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper is gently reproducible. Although the code is not provided, the datasets are publicly available, and the authors provide sufficient details regarding experimental designs.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

• It would be more interesting and reasonable if the local features can dynamically cover regions with different sizes. Or the proposed fixed-size setting can be theoretically/experimentally justified to be effective enough. • In Tables 2 and 3, “local-mi” has clear advantages when the features are fine-tuned, but generally has lower performance compared with “global-mi” when the features are frozen. What is the possible reason that “local-mi” performs differently?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The paper is well written and organized.
- The idea to maximize local mutual information is novel and insightful.
- The proposed approach is supported in both theoretical and experimental perspectives.
- Some important justifications/explanations are missing (e.g., size of local features, different performance of fine-tuned and frozen features).
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

8
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

In this paper, the authors proposed a multimodal representation learning framework for images and text by maximizing the mutual information between their local features. The experiments on public datasets showed that the local mutual information approach can improve the image classification results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well written.
2. The proposal method is interesting and shows good results in the experiments.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The experimental section is weak. The validation is too simple. Many existing research in image and text integration, especially in computer vision community. The paper doesn’t compare these methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility looks fine.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I am positive on this paper, but the authors should consider strengthening the validations, especially comparing to more related methods.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is interesting and shows good performance on the public datasets. But the experimental section should be further improved.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The authors proposed the first attempt to exploit the image spatial structure and sentence-level text features with MI maximization to learn image and text representations that are useful for subsequent analysis of images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Learning representations from clinical free texts is a promising direction.
2. Minimizing the local mutual information between image regions and sentences is an interesting idea.
3. The results of proposed methods are impressive.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The key idea behind Local-MI is somewhat confusing. The proposed method first randomly picks a sentence and maximizes the mutual information between a matched image region (feature) and this sentence. So, what if the sentence does not match any region (such as, there is no pleural effusion)? How to deal with this case?
2. The analyses of experimental results are not sufficient which is the reason decreasing my rating to this paper. Based on my first comment, I can understand that there are cases where Global-MI performs better than Local-MI. But authors failed to give any explanations or visualizations of these phenomena.
3. The provided experimental results are not reliable. Neither of standard deviations or p-values are reported.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The proposed method is reproducible based on given details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors should address the problems proposed in the weakness part.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factor which makes me give this rating is that the proposed method outperforms image-only models by large margins, although the underlying reason is still not clear (what factors make proposed approach so strong).
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

1
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Please see strengths and weaknesses of the paper summarized below. Please try your best to address the items under weaknesses and answer reviewer questions in your rebuttal.

Strengths:
- Minimizing the local mutual information between image regions and sentences is an interesting idea.
- Multimodality representation learning with both image and text is not new. But learning with local features, instead of global features, is novel. Furthermore, maximizing local mutual information is more reasonable to handle a radiology report, in which different sentences usually describe different regions of an image.
Weaknesses:
- Validation is too simple
- Lacking comparison to other image and text integration methods
- Analysis of the results could be improved. Standard deviations or p-values should be reported.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Author Feedback

We thank the reviewers for their feedback. We address their questions here and will revise the paper to improve the discussion of the advantages of local MI to reflect our response. R1 raised questions about image feature size and whether it should be identified during training. We agree! In this first investigation of the benefits of local MI, we fixed the feature size; enabling the network to learn optimal feature size is an exciting future direction. R1 and R3 wondered why the local MI offers substantial improvement in performance when the features are fine-tuned with the downstream model, while its performance is comparable with global MI if the features are frozen for the subsequent classification. In our experiments, training jointly with the downstream classifier (fine-tuning) typically improves performance of all tasks, with greater benefits for local MI. This suggests that local MI yields more flexible representations that adjust better for the downstream task. Our results are also supported by the analysis in Section 3 that shows that under the Markov assumption, the sum of local MI values is the lower bound to the global MI. In response to R1’s question about random initialization vs. employing pretrained models, we agree that investigating initialization schemes is an important part of model optimization but is not related to the key innovation of our paper. We emphasize that all variants in our experiments were trained using the same strategy, thus enabling us to isolate the effects of employing local MI. R2 requested that we include comparison to existing methods for image-text joint learning. We emphasize that our goal is to investigate the advantages of using local features for image-text representation learning and thus we focus on comparing local MI with global MI while keeping the architecture and the downstream task the same across all approaches. To the best of our knowledge, this is the first empirical demonstration and theoretical analysis of advantages offered by local MI for modeling joint image-text structure. Representation learning is an active research area with contrastive and MI-based approaches leading the field (Hjelm et al. (2018), Zhang et al. (2020)). Any state-of-the-art representation learning framework, such as CNN-RNN joint embedding, can be readily improved by employing local MI in its loss function and the feature selection as we explain in the paper. R3 raised a question about sentences that do not match any image region (e.g., “no pleural effusions”). As we explain in the paper, for each sentence, we select an image region that has maximal MI with it. A sentence about (present or absent) pleural effusions in a radiology report is more correlated with the appearance of pleural space below the lungs (brighter due to effusion or not) than with any other regions in the image. Thus pleural space will be selected for the local MI maximization for “pleural effusions” (present or absent). R3 requested that we include further discussion of factors that contribute to the strength of our approach. We will improve the discussion to clarify that the advantages of the local MI are tri-fold: 1) better fit to image-text structure: each sentence is typically a minimal and complete semantic unit that describes a local image region (Fig. 1) and therefore learning at the level of sentences and local regions is more efficient than learning global descriptors; 2) better optimization landscape: the dimensionality of the representation is lower and every training image-report provides more samples of image-text descriptor pairs; 3) better representation fit to downstream tasks: as Hjelm et al. (2018) demonstrated, image classification usually relies on local features (e.g., pleural effusion detection based on the appearance of the region below the lungs) and thus by learning local representations local MI improves classification performance. We will include standard deviations and p-values requested by R3.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The idea of minimizing the local mutual information is novel and clearly motivated. The authors addressed most of the concerns in the provided rebuttal, and I recommend acceptance of the paper. If accepted, the authors should try to include additional results to support the claim that “any state-of-the-art representation learning framework, such as CNN-RNN joint embedding, can be readily improved by employing local MI in its loss function and the feature selection”. Authors should also include the promised further discussion on strengths of the paper in the final version.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a multimodal representation learning method by maximizing the local mutual information between image and text features. The idea is interesting. In the rebuttal, the authors have satisfactorily addressed the following concerns raised by the reviewers, e.g., why local MI gives substantial improvement, random initialization, comparison for image-text joint learning, matching between image and sentences, and discussion section extension.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
I read all reviews and primary AC’s previous meta-review and also I read the paper. This paper looks like unfinished. The proposed method has demonstrated some interesting novelty but the experimental results and validation are unacceptably poor. To be fair, we have to make decisions based what authors have submitted. For a journal paper submission, this paper will be rated as “resubmitted as new” but this is a conference submission. Based on the very poor and clearly inadequate experimental evaluation, I would recommend “reject”.

“Weaknesses:
- Validation is too simple
- Lacking comparison to other image and text integration methods”
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

back to top

Multimodal Representation Learning via Maximization of Local Mutual Information