Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Hang Li, Fan Yang, Xiaohan Xing, Yu Zhao, Jun Zhang, Yueping Liu, Mengxue Han, Junzhou Huang, Liansheng Wang, Jianhua Yao

Abstract

The fusion of heterogeneous medical data is essential in precision medicine to assist medical experts in treatment decision-making. However, there is often little explicit correlation between data from different modalities such as histopathological images and tabular clinical data. Besides, attention-based multi-instance learning (MIL) often lacks sufficient supervision to assign appropriate attention weights for informative image patches and thus generates a good global representation for the whole image. In this paper, we propose a novel multi-modal multi-instance joint learning method, which fuses different modalities and magnification scales as a cross-modal representation to capture the potential complementary information and recalibrate the features in each modality. Furthermore, we leverage the information from tabular clinical data to optimize the MIL bag representation in the imaging modality. The proposed method is evaluated on a challenging medical task, i.e., lymph node metastasis (LNM) prediction of breast cancer, and achieves the state-of-the-art performance with AUC of 0.8844, outperforming the AUC of 0.7111 using histopathological images or the AUC of 0.8312 using tabular clinical data alone. An open-source implementation of our approach can be found at https://github.com/yfzon/Multi-modal-Multi- instance-Learning.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_51

SharedIt: https://rdcu.be/cymbd

Link to the code repository

https://github.com/yfzon/Multi-modal-Multi-instance-Learning

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper is focused on learning fusion from multi-modal data. A new multi-modal learning technique is provided that can integrate information from image data and text information. A particularly interesting contribution is that the images used are whole slide images (WSI), which often need to be analysed as patches (treated as multiple instances).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Learning from multi-modal data is an important motivation, particularly in precision medicine where data may come from a variety of sources. The paper is hence solving an important problem.

A logical formulation of the instance learning process i as a weakly supervised learning problem that accumulated the information learned from each patch of the WSI. This allows the representation of each WSI image as a bag of patch-based image information.

Thorough experimental protocol with held out validation and test data, an ablation study, and comparison with other work from the literature. Multiple metrics were used.

Good use of figures to support text - conceptual methodology is clearly communicated.

The results show that the method achieves its goals. For example, Fig 3 shows spatially varying attention, which indicates that the model is learning by considering different aspects from different parts of the WSI
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The results, while thorough, could benefit from some additional statistical information (standard deviation of the metrics, confidence intervals, p values).

The text in the figures are difficult to read at standard magnification.

Some errors of expression throughout the text.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will provide the code. The dataset used was acquired from a collaborating hospital. Labelled was done by the consensus of two observers.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Overall, a clear paper that addresses an important problem. A few suggestions for improvement:
1. While there is a good link of the first technical contribution to the challenges for multi-modal learning, there isn’t much related work listed. Data fusion has been an important research goal in the last few years and one suggestion would be to indicate recent work in this domain (not necessarily in depth but references so that an interested reader can see more broadly).
2. For clinical information it would be good to state the text pattern matching algorithm used to generate a structured representation, and justify the reason why it was chosen.
3. The text in Fig 2 is difficult to see at 100% magnification. In particular the white text in the orange boxes. Suggest that the image is resized (perhaps lay out the image again using the full width).
4. A confidence interval or p value should be included.
5. Please a quick proof reading pass to fix up some errors of expression. For example: modal training -> model training
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is solving an important problem of data fusion from medical imaging data. It presents a deep learning methodology to in sufficient detail that attempts to address this problem in a novel way. The experimental results demonstrate that the methodology works as intended. There are a few minor issues that can be addressed, but overall the intellectual quality of the work means that with a bit of extension this could be a journal paper.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

This paper proposed a multi-modal and multi-instance learning network for lymph node metastasis prediction of breast cancer. There are two contributions: (1) combining the histopathological images and tabular clinical Information (2) combining images from different magnification scales
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The strengths of the paper include: (1) propose a multi-modal multi-instance fusion module to generate a cross-modal representation of different modalities. (2) utilize an attention based method to facilitate the informative instances.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The weaknesses include: (1) This paper lacks novelty, the combination of image and non-image modality seems to be very common. And, the feature fusion methods are generally based on concatenation and attention, which are also very common. (2) The introduction part is a little unclear and lacks logic, should find the key points. For example, combining histopathological images and tabular clinical Information (not stressed in the contribution part)。
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

(1) How to get the images of different magnification scales? (2) The authors mentioned that the number of instances are dynamic, which may cause unstable model training. It is unclear how the author to avoid that. (3) The feature extractors for images of different scales are shared?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

(1) Figure out the focus of the paper, especially in the Introduction part. (2) Explain clearly the data used. All WSIs are scanned at 20x magnification, how to get the 10x and 5x images? resize? (3) Why using different learning rates for different parts, is there any experimental evidence to verify that? (4) More feature fusion methods should be explored.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

(1) The novelty, (2) clearness of Introduction part, and (3) the completeness of experiment
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper
- This work explores the use of multi-instance learning (MIL) for lymph node metastasis (LNM) prediction. By incorporating clinical information, the authors hope to better guide the MIL model to discover what’s important for the LNM prediction.
- Further, they make use of multiple scales to capture the information from the multiple scales of the histology image to get better representations.
- The authors also make use of a tabular model to process the clinical data prior to the fusion.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper targets building a complete model, which makes use of all the available information, different modalities and scales, to comprehensively combine them for prediction tasks.
- Each of the key modules are separately described, making it intuitive to follow.
- A thorough experimental evaluation and comparison with different state of the art baselines is presented.
- The multi-scale MIL framework can naturally be extended to allow post-hoc interpretability by understanding importance of the different locations. This allows for an understanding of what the model is focusing on, yielding some interpretability.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The rationale behind the use of a separate clinical-based prediction model is unclear. How are the outputs of the two classifiers combined?
- The tabular model is not described sufficiently. How is this different from regular networks? What’s the intuition behind these models?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Some details of the model training is described, including number and type of GPUs used, optimization algorithm etc. Initialization of the model not mentioned.
- The tabular model is not sufficiently described, and cannot be reproduced as is. What are the 18 clinical attributes extracted?
- Was the best model chosen with early stopping, or at the best epoch on the validation set, or after a fixed number of epochs on the validation set? How were the models initialized? Were the models trained end-to-end, or module-wise?
- How long does it take for the model to converge?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- The aggregation of the patch-level features across all selected patches of the WSI results in the loss of spatial information, which could prove useful in a complex task like LNM prediction.
- The performance of the different methods across different runs/folds is critical to completely evaluated and compare them.
- How are the patches selected for experimentation?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Despite the mentioned weaknesses, the paper presents a novel idea which tries to take in all the available information across modalities and scales in an effort to build a comprehensive model. Such efforts are crucial to improve performance of machine learning models.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper addresses an important problem to take advantage of all available information across modalities and scales for lymph node metastasis (LNM) prediction. Major concerns include: 1) To include some additional statistical information (such as standard deviation and confidence intervals; 2) not clear on how to avoid the variance caused by the dynamic number of instances; 3) lack of details in model training and experimental setup; and 4) typo/grammar errors.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Author Feedback

We sincerely thank all the reviewers and ACs for their constructive comments and will address major concerns in the following.

AC&R#1&R#3: Include some additional statistical information. We calculated 95% confidence intervals (CI) for the proposed method. The 95% CIs of our method are AUC (0.8605-0.9083), F1-score (0.7645-0.8280), Precision (0.7122-0.7967), Recall (0.8001-0.8774). We will complete the statistics of all methods in the final version.

AC&R#2: How to avoid the variance caused by dynamic number of instances? For a dynamic number of instances, each instance is assigned with an attention value, and then aggregated by this value to generate a global embedding. During this process, the dimension representing the numbers vanishes and therefore has no significant impact on the aggregation process.

AC&R#3: Details in model training and experimental setup. The model was initiated with Xavier Initialization, and the best model was chosen at the best epoch on the validation set. The parameters of the image feature extraction part are fixed, and the rest of the model is trained end-to-end. The training process takes about 10 hours to converge.

R#1: Details about text pattern matching algorithm used to generate a structured representation. The original medical records are given in semi-structured natural language descriptions in Chinese. We wrote a set of matching rules based on regular expressions to extract the structured information, eg., age, tumor side, histological classification, etc.

R#2: Lack novelty and clearness of Introduction part. We respectfully disagree. Due to the super large dimensions, the multi-magnification nature of WSI images, and the domain disparity of WSI and tabular data, conventional multi-instance learning (MIL) methods were not designed for this specific scenario and thus resulted in limited performance. We proposed a multi-modal multi-instance method to handle the cross-modality fusion of pathological images and tabular data. The novelty of our proposed method lies in four aspects: 1) A well-designed module employing cross-modal learning to integrate complementary information between imaging and tabular data. Multi-modality fusion helps pathologists to simultaneously identify useful information in different modalities during diagnosis. 2) A multi-scale mechanism to fuse global and local features from large pathological images. 3) A novel MIL framework that aggregates instances at different scales. 4) A visualization module to interpret multi-modal multi-scale results. These novel components are validated by extensive ablation studies and outperform the state-of-the-art methods (AUC 88.44% vs 85.70%). Both R#1 and R#3 agree with the novelty of our approach and rate the clarity and organization as very good.

R#2: How to get the 10x and 5x images? Digital pathology images are in pyramidal structures, and the files are internally organized with multiple magnifications. The 20x images mentioned in the paper refers to the highest magnification, and the 5x and 10x images can also be retrieved from the source file and not obtained by resizing.

R#2: Why using different learning rates for different parts? Different modules in our proposed method are responsible for the feature extraction of different modalities. The MIL module is for processing image-based features, and the tabular module is for processing table base features. Our method allows different modal components to have different learning rates, which is verified by experiments to be beneficial for the network to efficiently select optimal parameters and improve convergence speed.

R#3: The rationale behind the use of a separate clinical-based prediction model. The separate prediction model is inspired by Deeply Supervised Nets and can help improve the stability of the model training process. This separate classifier G_aux is only used during the training process, and only the multi-modal branching classifier is used for the inference.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All major concerns have been well addressed. This paper deserves to be published in MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The author has clarified the main issues raised by the reviewers. Considering the proposed methodological contributions, such as learning fusion from cross-modal data and combining images from different magnification scales, I think this paper is acceptable.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Reviewers have consensus about some merits of this paper. The rebuttal has also clarified some concerns about the model training, experiment setting, and the results. The major argument is about the significance of the novelty of this paper, which, however, is somewhat mild. The clarification about the novelty in the rebuttal sounds acceptable to me for MICCAI publication. There is just a minor point for the authors’ consideration. In introduction, the advantages of the proposed method were discussed against MMCNN-MIML[17]. It would be more convincing for this discussion if the comparison with MMCNN-MIML is also reported.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

back to top

Multi-modal Multi-instance Learning using Weakly Correlated Histopathological Images and Tabular Clinical Information