Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Yao Zhang, Jiawei Yang, Jiang Tian, Zhongchao Shi, Cheng Zhong, Yang Zhang, Zhiqiang He

# Abstract

Liver cancer is one of the most common cancers worldwide. Due to inconspicuous texture changes of liver tumor, contrast-enhanced computed tomography (CT) imaging is effective for the diagnosis of liver cancer. In this paper, we focus on improving automated liver tumor segmentation by integrating multi-modal CT images. To this end, we propose a novel mutual learning (ML) strategy for effective and robust multi-modal liver tumor segmentation. Different from existing multi-modal methods that fuse information from different modalities by a single model, with ML, an ensemble of modality-specific models learn collaboratively and teach each other to distill both the characteristics and the commonality between high-level representations of different modalities. The proposed ML not only enables the superiority for multi-modal learning but can also handle missing modalities by transferring knowledge from existing modalities to missing ones. Additionally, we present a modality-aware (MA) module, where the modality-specific models are interconnected and calibrated with attention weights for adaptive information exchange. The proposed modality-aware mutual learning (MAML) method achieves promising results for liver tumor segmentation on a large-scale clinical dataset. Moreover, we show the efficacy and robustness of MAML for handling missing modalities on both the liver tumor and public brain tumor (BRATS 2018) datasets. Our code is available at https://github.com/YaoZhang93/MAML.

SharedIt: https://rdcu.be/cyhMz

# Reviews

### Review #1

• Please describe the contribution of the paper

This article presents a deep learning approach to process liver tumor segmentation with multi-modalities. The model is flexible with missing modalities. Its main contribution is the presented modality-aware (MA) module, which fuses deep learning features from different modalities to produce better segmentation results.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The design of MA module is novel and details are clearly presented.
2. It has clear presentations and the presented techniques are easy to catch up with.
3. The intuition behind methodological designs are well explained, for example, in the abstract, with ML, an ensemble of modality-specific models learn collaboratively and teach each other to distill both the characteristics and the commonality between high-level representations of different modalities. ‘’
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is no obvious weakness of the presented work. I only has some minor comments,

1. The line denoted as F_dual’’ seems a typo as ‘‘F_dual’’ is used right before, on page 4.

2. On page 5, A_x'' seems a typo, should beA_i’’.

3. In table 1, item MS+MA’’ is not explained.

4. On page 8, PAM’’ has not been defined before.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code will be released upon acceptance. The presented method has been validated on a public dataset. Therefore, the experiment results should be able to be reproduced.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This article has clear presentation and logic. Its proposed method is sound in technical design. However, it is a modification or upgrade of existing work, i.e., HeMIS [6]. Thus, I rate is as probably accept.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #2

• Please describe the contribution of the paper

This paper addresses mutual learning between multi-phase (or multi-parametric) images for segmentation purposes. An ensemble of modality-specific models with an attention mechanism learn collaboratively to distill the characteristics between high-level representations from different modalities. The framework is optimized using both intra- and inter-phase losses. Experiments performed for multi-phase liver and multi-parametric brain tumor segmentation demonstrate the ability of the proposed model to deal with missing modalities.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• By providing a visual interpretation of the contribution of each phase, attention maps can provide useful guidance for clinical practice
• Methodology able to handle missing modalities, not-widely investigated but frequent scenario
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• No clear innovation in relation to existing works on cross-modality learning and attention mechanisms
• The “multi-modal” terminology is somewhat misleading since only dynamic contrast-enhanced CT or multi-parametric MR images are exploited
• Both methodological and experimental sections lack precision
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code will be made publicly-available in case of acceptance. The abdominal CT dataset is private contrary to BraTS 2018 data which can be used to reproduce experiments.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Although not very innovative in relation to existing works on cross-modality learning and attention mechanisms, the methodology is of interest for the medical community since it: 1- exploits both complementary and redundancy between multiple images, 2- provides a visual interpretation for clinical guidance and 3- handles missing modalities. However, a number of concerns that need to be included to meet the MICCAI requirements is missing. Details are provided below.

• Formulation. The “multi-modal” terminology is somewhat misleading since you process Dynamic Contrast-Enhanced (DCE-) CT scans or multi-parametric MR images only. In particular, “multi-phase” or “DCE-CT” would be more suited than “multi-modal CT” to my opinion.
• Globally, the methodological section (Sect. 2) is hard to follow and the writing could be improved and simplified. Be careful to define all mathematical formulations!
• Modality-specific model (Sect.2.1). You use nnUnet using dual-phase CT volumes (individually) as inputs to provide high-level semantic embeddings of each specific phase. It is not clear how you obtain such high-level semantic embeddings. Up to which network layer do you stop to obtain the representation $\textbf{F}_i$? Moreover, you should define C, D, H and W with respect to source image dimensions.
• Modality-aware (MA) module (Sect.2.2). I understand that you need both $\textbf{F}{dual}$ and $\textbf{F}_i$ to estimate attention weights (Eq.1). However, you should explain why you need to concatenate $\textbf{F}_i$ and $\textbf{F}{dual}$ as input of the convolutional stream since $\textbf{F}{dual}$ already includes $\textbf{F}_i$. In addition, $\textbf{F}{dual}$ seems “visually” smaller than the concatenation of the FCN outputs in Fig.1. Why? Finally, 1- “which indicates the significance of the features in $\textbf{F}{dual}$, denoted as $\textbf{F}{dual}$” in unclear, 2- $f_a$ involved in Eq.1 is not defined.
• Mutual learning (ML) strategy (Sect2.3). The teacher-student learning formulation where each modality-specific model interacts as a teacher and a student mutually is not convincing…
• Experiments. Is there any cross-validation strategy involved for you experiments?
• Baseline methods. Results arising from a single network with as inputs a concatenation of both (registered) arterial and venous phase images should appear in Tab.1. Do you employ this strategy when you run nnUNet in Tab.1? More globally, I suggest you to explain exactly how you use nnUNet and OctopusNet baseline methods!
• Ablation study. I do not fully understand the strategy which consists in applying MA without ML. Does it mean that you use $L_{inter}$ only as loss function?

• To provide a better overview of related works, I suggest you to position your contributions with respect to other existing studies integrating multi-phase abdominal CT scans for liver tumor segmentation using learning techniques: “Automatic segmentation of liver tumors from multiphase contrast-enhanced CT images based on FCNs” (AIIM 2017), “Scale-adaptive supervoxel-based random forests for liver tumor segmentation in dynamic contrast-enhanced CT scans” (IJCARS 2017), “Liver tissue segmentation in multiphase CT scans using cascaded convolutional neural networks” (IJCARS 2019)…
• I not fully agree with the “CT imaging is recommended for better diagnosis of liver cancer” assumption provided in the abstract. In some cases, MR images allow a better distinction between healthy and hepatocellulcar carcinoma (HCC) tumoral tissues.
• Contrary to what you state in the sentence “we compare MAML with recent advanced multi-modal methods, nnUNet […]”, nnUNet is not natively designed to process multi-modal data!
• Acronyms: please define MA (modality-aware) and ML (mutual learning) in Sect.1 (only mentioned in the abstract).
• When you describe the different multi-modal feature fusion strategies, you could indicate for the first one that multi-modal images must be registered before being provided as inputs of the network.
• Dataset description. What is a “top hospital”? You should indicate 327 contrast-enhanced CT examinations instead of 654 volumes. Is there any fusion of annotations from the 3 experienced clinicians?
• Visual results on BraTS dataset would be very appreciated!
• Statistical analysis. You could use t-tests to further prove the superiority of MAML with respect to baseline methods.
• Typos: “computed” instead of “Computed” in the abstract, “, even by” instead of “by even” in Sect.1, “issue is how” instead of “issue is that how” in Sect.1, “computations” instead of “computation” in Sect.1, “multi-modal” instead of “mutli-modal” in Sect.1, “state-of-the-art” instead of “arts” in Sect.1, “where” instead of “that” in the last sentence of Sect.1, “as inputs” instead of “as the inputs” in Sect.2.2, “ML” instead of “MT” in Sect.2.3.

borderline reject (5)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• (-) No clear innovation
• (-) Lack of precision for both methodological and experimental sections
• What is the ranking of this paper in your review stack?

4

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #3

• Please describe the contribution of the paper

This paper introduces a novel framework to perform multi-modal image segmentation. Instead of concatenating the image modalities as input, the authors propose to use an ensemble of modality-specific networks. This allows for performing image segmentation with only one image modality at inference (missing modalities scenario). To perform multi-modal image segmentation, the outputs of the modality-specific networks are fused using an attention module. The networks are trained simultaneously to perform image segmentation both independently and collaboratively.

Experiments are performed on two tasks: liver tumor segmentation using CT scans (private dataset) and brain tumor segmentation using only T1 contrast-enhanced scans (public dataset).

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The paper is clear and well motivated
• The proposed technique is simple and easy to implement.
• Extensive experiments demonstrate the effectiveness of the approach on two different scenarios (multi-modal and uni-modal image segmentation) and on two clinical problems (CT liver tumor and MRI brain tumor segmentation)
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• The analysis of the limitations of existing techniques for missing modalities is imprecise and not convincing.
• Confusion notations with some errors
• There are missing references. The authors use a standard spatial attention module without referencing existing work.
• The technical novelty is limited. The core of the framework is to fuse the modality-specific features using a modality-aware attention module. This attention module has been previously proposed in others contexts (e.g., [1], [2])
• There are missing implementation details (see below)
• The authors do not discuss the limitations of their method

[1] Zhang, et al. “ET-Net: A Generic Edge-aTtention Guidance Network for Medical Image Segmentation”. MICCAI 2019 [2] Wang, et al. “Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss”. MICCAI 19

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors used a private dataset for their main set of experiments. However, the authors have answered “Not Applicable” to the questions related to the dataset. Will the authors release their CT dataset?

There are some missing implementation details (see section 7). This work is not reproducible in the current state.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors claim that their approach allows for missing modalities. In pratice, the proposed framework allows either for the full set of modalities or only one modality as input. The framework especially doesn’t handle an incomplete set of modalities with more than 1 modality as input. Consequently, the authors shouldn’t describe their method as able to “handle missing modalities”. This should additionally be discussed as a key limitation of the approach in the discussion section.

Analysis of the limitations of the existing techniques:

• Dorent, et al do not synthesize images to perform image segmentation with missing modalities. Image generation is a byproduct of their framework. Consequently, the comment “Synthesizing missing modalities is inclined to heavy computation” is not relevant.
• Hu, et al use an additional teacher model which indeed “brings up extra computation costs and limits the multi-modal representation”. However, the technique proposed by the authors is more computationally expensive. Indeed, at the training stage, Hu, et al use 2 networks (student and teacher), while the proposed framework employed N modality-specific networks with: 1/ N>=2; 2/ an extra attention module. Moreover, at inference time, only 1 network is employed by Hu, et al for both unimodal and multimodal scenarios, while the proposed framework uses either 1 network (unimodal) or N networks (multimodal).

Notations: 1/ Introduction:

• The acronyms MA and ML are not defined in the main manuscript (only in the abstract, which is not enough) 2/ Section 2.1:
• What is C,D,W,H? Isn’t it C=1? 3/ Section 2.2:
• Paragraph 3: which indicates the significance of the features in $F_dual$, denote as $F_dual$?
• Eq (1): Are the modality-aware modules the same for each modality, i.e. the parameters $\theta$ are the same for $i\in{arterial, venous}$
• Non defined $x$ in attention map $A_x$
• Eq (2): the index notation in the sum is incorrect –> $\sum_{i \in { arterial, venous } }$ 4/ Section 2.3:
• Eq (3): the index notation in the sum is incorrect –> $\sum_{i \in { arterial, venous } }$
• I found the term “inter” confusing because the loss isn’t between. Proposition: “joint”? 5/ Section 4 (page 8): what is PAM?

Missing references: The approach is very similar to “AMC: Attention guided Multi-modal Correlation Learning for Image Search” (CVPR, 2017). Both approaches use attention modules to fuse unimodal feature maps and the learning strategy is comparable. The proposed attention module is a relatively standard spatial attention module (e.g., [1,2])

Experiments: Liver tumor: Are the improvements significantly significant? (e.g. with a Wilcoxon signed-rank test)

Implementation details:

• Could the authors provide some details on the implementation of the nnUnet used as a modality-specific feature extractor?
• The proposed technique has twice as many parameters as a multi-modal nnUnet. Could the authors provide a fair comparison with a larger multi-modal nnUnet?
• Brain tumor: Could you please provide some details on the training procedure for each nnUnet? Was lambda set to 0.5?

Limitations of the proposed technique:

• The authors claim that their approach allows for missing modalities. However, the proposed framework allows either for the full set of modalities or only one modality as input. Consequently, the authors shouldn’t describe their method as “handling missing modalities” but as “handling a single modality”. This should additionally be discussed as a key limitation of the approach in the discussion section.

Others: Page 3: “mutli-modal” → “multi-modal” Section 2.3: MT → ML It would be nice to study other forms of correlation.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed technique is novel for multimodal image segmentation. Experiments are intensive and demonstrate the benefits of the proposed technique compared to traditional techniques.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a method to perform tumor segmentation with multi-modal images. The core of the method is to fuse the modality-specific features using a modality-aware attention module. This attention module has been previously proposed in others contexts. However, applications in the context of missing modalities are new and very interesting. Experiments performed for multi-phase liver and multi-parametric brain tumor segmentation demonstrate the excellent results of the proposed model to deal with missing modalities. However, authors need take all reviews into account to improve the quality of the article, especially to make the methodological and experimental sections clearer.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

# Author Feedback

We would like to express our sincere gratitude to all the ACs and the reviewers for their time and efforts spent on our paper. The comments are very valuable for us to improve the paper. We promise a careful revision including but not limited to the formulation, the details of methodology and experiments, the analysis of existing works and references, the discussion of limitations, and proofreading.

1. On the design of Modality-specific model The high-level semantic embeddings are from the last convolution layer of the Modality-specific model (i.e., nnUNet). The advantages of this design come from the following two aspects: (a) the embeddings from the deepest layer share the largest receptive field; (b) they also have the same spatial dimension of the input image and the spatial information is a key point for segmentation.

2. On the design of Modality-aware module $\textbf{F}i$ and $\textbf{F}{dual}$ are concatenated to estimate the attention weights. Although $\textbf{F}_{dual}$ already includes $\textbf{F}_i$, this design intends to measure the the correlation inside $\textbf{F}_i$ and cross $\textbf{F}_i$.

3. On the experiments In Table 1, nnUnet actually takes a concatenation of both (registered) arterial and venous phase images, as the usage of its original paper. We also apply OctopusNet in its original way but change the backbone network to nnUNet for a fair comparison. “MS+MA” means using $L_{inter}$ only as a loss function indeed. We would like to make it clear in the camera-ready version.