Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yang Yu, Peng Hu, Jie Lin, Pavitra Krishnaswamy

Abstract

Content-based image retrieval (CBIR) is of increasing interest for clinical applications spanning di_erential diagnosis, prognostication, and indexing of electronic radiology databases. However, meaningful CBIR for radiology applications requires capabilities to address the semantic gap and assess similarity based on _ne-grained image features. We observe that images in radiology databases are often accompanied by free-text radiologist reports containing rich semantic information. Therefore, we propose a Multimodal Multitask Deep Learning (MMDL) approach for CBIR on radiology images. Our proposed approach employs multimodal database inputs for training, learns semantic feature representations for each modality, and maps these representations into a common subspace. During testing, we use representations from the common subspace to rank similarities between the query and database. To enhance our framework for _ne-grained image retrieval, we provide extensions employing deep descriptors and ranking loss optimization. We performed extensive evaluations on the MIMIC Chest X-ray (MIMICCXR) dataset with images and reports from 227,835 studies. Our results demonstrate strong performance gains over a typical unimodal CBIR strategy. Further, we show that the performance gains of our approach are robust even in scenarios where only a subset of database images are paired with free-text radiologist reports. Our work has implications for next-generation medical image indexing and retrieval systems.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_58

SharedIt: https://rdcu.be/cyl6y

Link to the code repository

N/A

Link to the dataset(s)

https://physionet.org/content/mimic-cxr/2.0.0/

Reviews

Review #1

Please describe the contribution of the paper

The authors present a Multimodal Multitask Deep Learning (MMDL) approach for content-based image retrieval. Compering to a standard unimodal retrieval approach and the unimodal version of the proposed approach, the MMDL shows better performance even for the scenarios where only part of database images come with corresponding free-text reports.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A new Multimodal Multitask Deep Learning model is proposed
2. The model is flexible for databases with paired images and reports and also for those with part of images coming with corresponding reports.
3. The proposed method may be potential useful for next-generation medical image retrieval.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Lack of comparison of performance among the proposed multimodal deep learning model and other multimodal deep learning methods in literature such as https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4533857/, https://paperswithcode.com/task/medical-image-retrieval/codeless.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No response received for the questions related to code release
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
For future work, I would recommend
1. Comparison of performance among the proposed and state-of-the-art multimodal deep learning models
2. Extension to CT and other modalities
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The performance comparison is not fair. The proposed model is compared with a standard unimodal CBIR method instead of state-of-the-art multimodal deep learning models. The authors did not respond to the questions related to code release.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The paper proposes a multitask learning approach for X-ray image retrieval by combining features from radiology images and semantic features from their associated reports
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

– application of multimodal learning combining features from both radiology images and reports for Xray image retrieval – the paper reports performance of the method for incomplete datasets, by considering scenarios where a subset of images lack associated textual reports
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– Not enough baseline comparison. To compare with other methods, the authors compared their unimodal model’s performance with DenseNet network (how the images were retrieved from DenseNet was not reported). However, there are existing approaches for chest X-ray image retrieval with which the authors could have compared their unimodal model, such as :

Chen et al., Order-sensitive deep hashing for multimorbidity medical image retrieval, MICCAI 2018, pp. 620–628.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

– Public dataset used – Model description included
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

– Comparison with other X-ray retrieval methods should be included
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

– application of multimodal multitask learning for X-ray retrieval is new – performance of the method for incomplete datasets reported
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The paper proposes am effective framework for CBIR in radiology that uses both radiological images as well as textual reports.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is very interesting as it uses a multimodal based approach for CBIR, There is good comparison between different models such as Dense121 and also when the inputs are changing.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although the method proposes that it uses textual report in the retrieval, it is not very clear how it is carried out. Fig. 1 summarises the approach but the description of the same is not provided, for example how is the coupling matrix created?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper proposes a framework for CBIR that uses both images as well as textual reports in radiology. It is an interesting paper. Below are my comments:

How is the text data (the report) handled in the analysis?. Fig is nice but description for the same is not provided, for example how is the coupling matrix created? How is the text data helping overall? and why is that the text data is helping? Is there some case where the retrieval is wrong just because of the report?

Minor comment: In the paper, it is mentioned “We accomplish this with two feature extraction and encoder networks to project the data inputs from each modality into a common subspace where the intra-class variation is minimized and intra-class variation is maximized.” Check this statement. Provide more details about how this subspace is created?
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting paper with good results
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper needs a rebuttal process to address the following reviewers’ constructive comments (from R1 and R2). I think they are very consistent and important to be successfully addressed in the rebuttal.

“Lack of comparison of performance among the proposed multimodal deep learning model and other multimodal deep learning methods in literature such as https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4533857/, https://paperswithcode.com/task/medical-image-retrieval/codeless.

– Not enough baseline comparison. To compare with other methods, the authors compared their unimodal model’s performance with DenseNet network (how the images were retrieved from DenseNet was not reported). However, there are existing approaches for chest X-ray image retrieval with which the authors could have compared their unimodal model, such as : Chen et al., Order-sensitive deep hashing for multimorbidity medical image retrieval, MICCAI 2018, pp. 620–628. “
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

Author Feedback

We proposed a Multimodal Multitask Deep Learning (MMDL) approach for Content-Based Medical Image Retrieval (CBMIR). Our approach leverages multimodal databases comprising radiology images and text reports; and performs robustly even when only a subset of database images is paired with reports. We thank the reviewers for their careful reading and enthusiasm for our work. The two main feedback points pertain to limited baseline comparisons and need for further methodological details, as addressed below. Baseline Comparisons: We consider deep learning based baseline methods for unimodal and multimodal medical image retrieval, in turn. Prior studies proposed deep learning methods for unimodal Chest X-Ray retrieval (Chen et al. (MICCAI 2018, DOI: 10.1007/978-3-030-00928-1_70), Fang et al. (MedIA 2021, DOI: 10.1016/j.media.2021.101981)). However, Chen 2018 used a custom split of the NIH-14 X-Ray dataset and did not release code, hence precluding comparison. Fang 2021 focused on the MIMIC-CXR dataset that we used and also released code. To address reviewer feedback, we analyzed our approach w.r.t. Fang 2021. Similar to our work, Fang 2021 used a deep triplet loss and attention mechanism. Unlike our work, they considered 4 common classes out of the 14 MIMIC-CXR classes on a different network. For a rigorous comparison on the value-add of our multimodal approach, we integrated image features from their method with textual features from the reports (as in our method) and used the multimodal feature vector for retrieval. The inclusion of multimodal features provided a 3% Mean Average Precision (mAP) improvement (from 0.812 to 0.844) on their test set. We can include a comparison with this latest baseline in a revision. We further highlight that our multimodal CBIR framework is a general modular framework, which can be adapted to enhance over other leading-edge unimodal retrieval approaches. Upgrades to any component of our approach are also possible. There has been limited focus on multimodal deep learning for CBMIR. An older study (Cao et al. (Cancer Inform. 2014, DOI: 10.4137/CIN.S14053)) introduced probabilistic Latent Semantic Analysis (pLSA) for retrieval of images accompanied by text tags. Their approach employs feature fusion and uses a deep Boltzmann network, but is limited to short text descriptors and focuses on single-task learning. In contrast, our approach leverages deep learning to (a) scale to full radiology text reports containing detailed findings and associated degrees of confidence, and (b) uniquely leverage multi-task learning (model fusion) to enable retrieval with a unimodal query and multimodal database. As such, our MMDL approach targets distinct and more complex retrieval tasks (e.g., for differential diagnosis) that do not allow multimodal query inputs by definition. We can include an expanded discussion in related work. As Cao 2014 considered a distinct type of retrieval task and dataset and did not release code, a baseline comparison is not possible. Methodological Details: We clarify the details requested by reviewers. For DenseNet121 baseline, we used features from average pooling layers, computed similarity using cosine distance and retrieved relevant database images. For text handling, we trained a Doc2Vec model on MIMIC-CXR text reports to extract a 300-dimensional text representation. The text data contains expert observations on image features and hence enhances focus on critical visual information to improve performance on-the-whole. The coupling matrix P defines the common subspace and is learned during network weight training (Hu et al. (SIGIR 2019, DOI: 10.1145/3331184.3331213)). To convert the information in input data to the common subspace: the encoder networks extract features from each of the data inputs, and the fully-connected layers reduce feature dimensionality, and project it to the common subspace. We can highlight these details in an improved Fig. 1 and implementation section.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper is proposed to address the multi-modal information retrieval problem for chest X-ray image dataset. From at least two reviewers, author did not do enough comparison against the related work. In the rebuttal, authors address these concern partially. Again, the nature of the methodological novelty is limited as presented in Sec. Methods. Another major concern is that this paper may not really help much on the diagnosis problem that matters mostly, meaning to give the correct disease labels and localize the suspicious disease area. The results shown in Figure 3 do not demonstrate anything that are strongly clinically relevant. The right examples, out of the top five image examples from the retrieval, they have quite different disease label distributions. For chest X-ray images, diseases can have slightly different and ambiguous visual appearance. From the methods proposed in this paper, it is unclear how this problem can be addressed?

” – Not enough baseline comparison. To compare with other methods, the authors compared their unimodal model’s performance with DenseNet network (how the images were retrieved from DenseNet was not reported). However, there are existing approaches for chest X-ray image retrieval with which the authors could have compared their unimodal model, such as :

Chen et al., Order-sensitive deep hashing for multimorbidity medical image retrieval, MICCAI 2018, pp. 620–628.”

The work around Multimodal X-Ray Image Retrieval have not been done much probably because it is not an important task to solve.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposed a learning framework that jointly learns multi-modal feature embedding for X-ray image retrieval. It shows some promising properties of the proposed method, such as retrieval with part of the images without corresponding reports. The major concerns from the reviewers were lack of baseline comparison. From the rebuttal, it seems this work emphasized the use of multi-modalities between images and text, rather than among different imaging modalities (I am OK with that response). However, from this perspective, this work seems to lack a corpus of literature about the joint learning of image-text feature embedding in the literature (such as those used in image-text matching tasks, and the learned feature embedding has been used for downstream applications such as classification, such as https://deepai.org/publication/weakly-supervised-feature-learning-via-text-and-image-matching and https://arxiv.org/abs/2010.03060). Therefore my attitude to this work is quite neutral: this is a borderline paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

11

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Authors have adequately addressed the major concerns on comparison with baseline methods. The paper is novel and contributes to MICCAI society. Nevertheless I have little concern about reproducibility. Since the major reason that this paper does not compare with Chen 2018 is due to their code was not published, and authors of this paper did not respond to code publication question, the advancement of this research question might run into the same problem in the future.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

14

back to top

Multimodal Multitask Deep Learning for X-Ray Image Retrieval