Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yulu Guan, Hui Cui, Yiyue Xu, Qiangguo Jin, Tian Feng, Huawei Tu, Ping Xuan, Wanlong Li, Linlin Wang, Been-Lirn Duh

Abstract

Radiotherapy plays a vital role in treating patients with esophageal cancer (EC), whereas potential complications such as esophageal fistula (EF) can be devastating and even life-threatening. Therefore, predicting EF risks prior to radiotherapies for EC patients is crucial for their clinical treatment and quality of life. We propose a novel method of combining thoracic Computerized Tomography (CT) scans and clinical tabular data to improve the prediction of EF risks in EC patients. The multimodal network includes encoders to extract salient features from images and clinical data, respectively. In addition, we devise a self-attention module, named VisText, to uncover the complex relationships and correlations among different features. The associated multimodal features are integrated with clinical features by the aggregation to further enhance prediction accuracy. Experimental results indicate that our method classifies EF status for EC patients with an accuracy of 0.8426, F1 score of 0.7506, specificity of 0.9255, and AUC of 0.9147, outperforming other methods in comparison.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_69

SharedIt: https://rdcu.be/cyl6M

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In “Predicting Esophageal Fistula Risks Using a Multimodal Self-Attention Network,” the authors propose a self-attention model for combining 3D CT volumes with tabular clinical data. A U-net encoder is applied to radiology, while clinical encodings are obtained through convolution, then expanded to a 3D array that can be concatenated with imaging features prior to the self-attention module. They evaluate the ability of this model to predict esophageal fistula from CT and clinical data.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The application of a self attention model to fuse imaging and clinical data for outcome data is exciting.
- The problem addressed (EF prediction) is of interest and has high clinical need.
- Ablation study helps demonstrate the contribution of various model components, and the authors compare against a number of baselines.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Some key information on the imaging data and clinical cohort is missing. For instance, when were scans acquired? Prior to treatment, or had patients already received treatment? Had EF already developed? The authors state that their approach is predictive but it’s hard to assess without knowing when the data was collected
- Similarly, very little detail is given on how selection was done for negative and positive cases. For instance, did the negative patients all receive the same types of treatment that led to EF in the positive group, or were they chosen from the general EC population? Given that the clinical features included treatment information, such selection biases could sway the model.
- If the objective is to predict EF risk pre-treatment, why does the model include precise information on chemoradiotherapy regimen? Could this actually be leveraged before the inception of treatment? The clinical model should only include information that would be available at the time a prediction needs to be made
- The ablation study is very nice, but it seems to be missing a very important baseline: what is the performance of a clinical-only deep model? Given the weak performance of imaging alone, it could non-linear operations on the clinical data is what’s doing the heavy lifting
- The input volumes are VERY small (48x48x27) and the images are downsampled considerably (from what I can tell, the final resolution seems to be nearly 5 mm^3 per voxel). I am skeptical at what predictive information there could be at this scale
- The choice of metrics seems somewhat conspicuous. The authors report high specificity and significantly reduced F1 score, which implies the model has rather low sensitivity. This should be included in the results so readers don’t have to infer this on their own
- It’s not clear how the approach differs materially from [17], a cross-modal self attention model incorporating image and text data the authors use as a comparison model. The authors point out what seems to be a small difference related to the dimensions of the key-query-value triad, but it’s not clear if this is the extent of the changes or what the impact of this change would be. Currently, it’s not clear why the authors model is performing better than [17]. The authors should better explain how their model improves upon other self-attention models such as [17] to allow the reader to better understand their contribution.
- The performance of the authors’ model seems to be below previous imaging-only models (Cui et al., MICCAI 2020 reported an accuracy >85% with a multi-view CNN in a similarly sized dataset), which seems to undercut the demonstrated benefit of integrating clinical and imaging data.
- Similarly, given those results, I am not sure the unimodal imaging comparison model selected from [25] is representative of the potential performance of an imaging only model
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility seems relatively strong, except the description of the dataset is not currently sufficiently detailed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- It would be helpful if Table 3 included what modalities were used by each method
- “Results of both 250 and epochs are given in Section 4.3.”
- I was confused by the reference to the authors model as “PM,” it’s not clear where the acronym comes from
- Were metrics assessed at the patient or sample level? How were cutoffs for acc/spec/F1 determined
- It would be helpful to have a better sense of what the timeline for EF looks like. Do patients develop EF during treatment, or later on? How much time was between the scans and the development of EF?
- The name VisText seems like a bit of a misnomer, since the model does not use any text data.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There were some notable weaknesses that reduced enthusiasm for the study, including lack of sufficient detail in dataset description, the choice of variables in the clinical model, and level of performance level relative to previous studies that did not include clinical features. Nonetheless, I believe the novelty of a self-attention model for clinical and imaging fusion makes this paper worthy of acceptance.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The paper presents a new architecture for multi-modality classification that utilizes self-attention for integrating image and non-image features. This architecture is tested and compared to other methods in a case study of predicting esophageal fistula (EF) from 3D CT scans and clinical data from questionnaires.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novelty: to the best of my knowledge, there are no similar works that apply a self-attention mechanism for multi-modality classification in imaging. The authors do compare their method to the method in [17], designed for segmenting natural images given a reference text, which also uses self-attention (for). However, it is not explained how the method in [17] was adapted for the current problem, and how it differs from the proposed method. But maybe the adaption of [17] for this problem can be considered a novelty as well.
2. Strong empirical results: The author compare their method to several methods, and their method outperforms all these methods. However, there are some unclear issues in the experiments and the results, which I will elaborate in the following.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. There are several unclear issues, as well as comments and questions, regarding the method and the experiments. These will be detailed below.
2. The authors claim that their method extracts “visual-clinical multimodal correlations which further improved EF prediction”. Therefore, I would expect them to provide some demonstration of detected correlations. This will provide support for their claim, as well as insight on how this method works, and even potentially lead to interesting discoveries. As example, ref [17] provided a nice demonstration of detected correlations between word-pairs image regions pairs, and between (word, image region) pairs.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

While the key idea of using self-attention for integrating different modalities is clear (and interesting), it will be hard to implement the specific proposed method as its description is lacking, with several parts being unclear. I describe the unclear issues in my detailed comments to the authors.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Major comments
1. The exact details of the method are unclear to me. If I understood right, then (1) the non-image features are formed as a 1D vector (of size C_M+C_O), where each feature is on a different channel, and (2) the image features are a 4D vector, with C_V channels. The non-image features are extended for a 4D tensor of the same shape of the image features by duplication them across the Depth X Width X Heigh axes, so each (depth, width, height) has different image features but same non-image features. Is the self-attention applied independently for each (x=width, y=height, z=depth) spatial position? i.e., it does not test correlations between image/visual features of different spatial positions?
2. Equation 1 seems erroneous (or is unclear): If I understood the method, then the self-attention runs independently per (x,y,z) spatial position over each channel, and hence the number of elements is C_M+C_O+C_V. On the hand, Equation 1 presents an unclear summation of C_M x C_O x C_v elements, as if summing over all triplets of (numeric, categorical, visual) features
3. The following sentence is unclear: “The following step converts \hat{f }{VOM} to the same shape as f{VOM} via a linear layer with ReLU activation function”. What is the shape of \hat{f}{VOM}? What is the shape of the linear layer (i.e. input/output), and how is it applied to \hat{f }{VOM}?
4. Section 4.2: Please provide more details on the difference between the self-attention module from [17] and your method.
5. “We ran each experiment three times and reported the average and standard deviation for both comparative and ablative studies.” What was the difference between the three runs? Only different initialization? Or also different train-test partition? Also, to get a more reliable measure of the standard deviation I would suggest running the experiment at least 5 times and using different train/test partition (e.g. with cross validation).
6. Ablation study: did you test running with only multi-modal features (i.e., without the final concatenation with the non-image features)?
7. Tables 2, 3: were accuracy, F1 score and specificity computed w.r.t. to a threshold of 0.5? Note that in this case non-calibrated models will yield worse results. Therefore, I would suggest to use the AUC as the primary measure.
8. Table 3 would be more readable if it specified for each model whether it used non-image data and/or image data.
9. When focusing on the AUC measure, then the second-best performing method is a simple logistic regression with clinical features ([6]). This is very surprising given that other compared models used both image and non-image data. Also, I would expect a comparison with a stronger non-image model, such as xgboost.
10. There should be a discussion with possible explanations to the inferiority of compared methods. In particular, please provide some explanation to the bad AUC of Chauhan et al. 2020 [20], which uses both image and non-image data, where the latter are processed with self-attention decoder (BERT).
Minor comments/questions:
1. Dataset section: the following sentence is unclear to me “Negative cases were then matched with positive cases to the ratio of 2:1 by the diagnosis time of EC, marriage, gender, and race”. Is this a preprocessing step where cases are excluded to ensure 1:2 match ratio for all cases, or a validity test? Why is it important to verify such matching? i.e., will the results change if such matching did not exist?
2. Table 1: If I understood it right, then each image feature (output of image encoder) maps to a large “3D patch” representing 1/9 of the entire 3D space? Did you test finer (e.g., 3x3x3, 1x5x5) or coarser (e.g. 1x1X1) resolutions?
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, I found the approach of multi-modality learning with self-attention novel and very interesting. In particular, since it has the potential of providing some insight, or explanations on correlations in the data. However, the are many missing details and unclear issues in the description of the method and the experiments, which raise some concerns for the motivation for the method and its superiority over existing multi-modality methods.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper presents neural network methods for the detection of such as esophageal fistula (EF) prior to radiotherapy using thoracic CT and 34 clinical variables. The network includes encoders for image and clinical feature extraction and a self-attention module. Image and clinical features are aggregated. The method was tested on data from internal data from 553 patients (186 with EF, 367 without EF). with decent accuracy.

The method was compared to five methods: gated attention, logistic regression using only clinical data, multimodal fusion, multimodal representation learning with joint loss and cross-modal attention, An ablation study was also performed. The results demonstrated both the good performance and the value of the elements of the technique.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The multimodal network architecture is novel The clinical application and data are interesting. The evaluation is nicely done with good performance, superiority to comparison methods, and ablation analysis.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No example images are shown to give the reader an idea of EF appearance in CT. No examples of successes or failures are shown.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method appears to be reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Images are needed in order to give the reader an idea of the problem and its difficulty. A brief description of the clinical data indicating at least the general categories (e.g., clinical history, treatment history, etc.) in the body of the paper would be helpful.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Well executed with novelty and good results on an interesting problem.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Strengths: The application of a self-attention model to fuse imaging and clinical data for outcome data is exciting. There have been no similar works
- The problem addressed (EF prediction) is of interest and has high clinical need.
- Ablation study helps demonstrate the contribution of various model components, and the authors compare against a number of baselines. The authors do compare their method to the method in [17], designed for segmenting natural images given a reference text, which also uses self-attention Strong empirical results: The evaluation is nicely done with good performance, superiority to comparison methods, and ablation analysis.
Weakness:
- No example images are shown to give the reader an idea of EF appearance in CT . No examples of successes or failures are shown.
- The authors claim that their method extracts “visual-clinical multimodal correlations which further improved EF prediction”. Therefore, I would expect them to provide some demonstration of detected correlations, which are missing.
- Some key information on the imaging data and clinical cohort is missing.
- Similarly, very little detail is given on how the selection was done for negative and positive cases.
- If the objective is to predict EF risk pre-treatment, why does the model include precise information on chemoradiotherapy regimen?
- The ablation study is very nice, but it seems to be missing very important details.
- The input volumes are VERY small (48x48x27) and the images are downsampled considerably. I am skeptical at what predictive information there could be at this scale
- It’s not clear how the approach differs materially from [17], a cross-modal self attention model incorporating image and text data the authors use as a comparison model.
- The performance of the authors’ model seems to be below previous imaging-only models (Cui et al., MICCAI 2020)
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Author Feedback

We sincerely thank the Area Chair and R2-4, for providing constructive and insightful comments. Please kindly refer to our responses below to the reviewers’ major concerns.

R2: More information on imaging and clinical data (i.e., selection criteria, EF timeline); Chemoradiotherapy (CRT) regimen may not be available at the time of prediction. • We confirm that all data, excluding those on treatment, were collected before treatment. • Considering that positive cases were rare compared to those negative, we started with collecting positive cases and then matched each of them with two negative cases, which has been commonly adopted in existent clinical research. As provided in Section 2, the matching criteria include diagnosis time of EC, marriage, gender and race, but no treatment types. All treatments administered in positive and negative cases followed the National Comprehensive Cancer Network (NCC) guideline. • CRT regimen is available in this retrospective study. Integrating CRT information in a prediction model can assist clinicians to devise treatment plans, and estimate the probability of developing fistula. • The development of EF from time of scan ranges in [0, 1401] days. • The above information and CT examples (requested by R4) will be added in Section 2.

R2: The input volumes being small. • We confirm that the cropped volumetric patches of 200 × 200 × 130 mm3 (in Section 2) were adequately large for encapsulating the region of interest, for the CT images are composed of 512 × 512 pixels with a voxel size of 0.8027 × 0.8027 × 5 mm3, representing 410 × 410 mm2. When training the model, we rescaled 3D patches to 48×48×27 due to limited GPU memory. We appreciate the comment on different rescaling sizes and will include it in our future study.

R2 and R3: Recognizing the ablation study and suggesting further experiments of clinical-only XGBoost model and multi-modal features only results. • The accuracy, specificity and AUC of multi-modal features (VisText Self-attention module) are 0.725, 0.890, and 0.785, while those of clinical data only XGBoost are 0.773, 0.872, and 0.830.

R2 and R3: The difference between our model and [17]. · Our model is a binary classification model, whereas [17] was originally proposed for semantic segmentation. We adopt the self-attention module in [17] to solve a new multi-modality information integration task. • In our model, imaging, numerical and categorical clinical data have their separate encoder and only sent to self-attention once, whereas [17] projected reference text of natural images into a compact word embedding by a lookup table, before concatenation with three-level visual data. The output of our self-attention model is sent to our aggregation module while [17] fed three-level self-attention outputs to a gated multi-level fusion module.

R2: The result seems to be below a previous imaging-only model (Cui et al MICCAI 2020), which achieved an accuracy of 0.857 by integrating multi-view information (CT, segmented tumor, and anatomical surroundings). • Compared to Cui et al MICCAI 2020, our model is more general and does not require segmented tumors as input. Moreover, when using CT alone without multi-views, our model (0.8366) has higher accuracy than that paper (the best accuracy of 0.802), demonstrating the benefit of integrating clinical and imaging data. Besides, our model was trained from scratch.

R3: Further explanation of the implementation and results of [20].
• We did not use the same image and text encoders from [20] because our input data were 3D images and non-text clinical data, different from 2D image data and free text in [20]. As noted in Paragraph 1 Section 4.2, we only applied the ranking-dot loss method [20] in our classification model. We will further clarify this in Section 4.2.

R3: Visualizing correlations The visualization can be achieved by interpreting attention scores from Eq (1). We will include the visualization in Section 4.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors addressed for R3’s concern about missing details. Some details yet need to be included in the revised paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper presents a radiomic method for esophageal fistula (EF) using thoracic CT and clinical variables. The application of a self-attention model to merge imaging and clinical data for outcome prediction is interesting and somewhat novel. The authors answer most of the questions in the rebuttal. However, the authors did not give a fair answer to an important comment: “The authors claim that their method extracts visual-clinical multimodal correlations which further improved EF prediction. Therefore, I would expect them to provide some demonstration of detected correlations, which are missing. Why?” Overall it is a good work. My proposition is “accept”.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The idea of applying self-attention in a multi-modal model seems to be novel and interesting. In the rebuttal, the authors did not provide an adequate response to concerns such as the input volume size and comparison to [17, 20]. The authors’ response to the clinical-only ablation study is somewhat confusing. It is not clear what the difference is between multi-modal features with the VisText Self-attention module and the result presented in the paper. As presented in the rebuttal, the XGBoost with clinical only data seems to have a similar or even better performance than the multi-modal model with the VisText Self-attention module. Clinical only XGBoost also has better performance than all baseline models, making the choice of baselines less favorable. Overall, I think the paper needs to be further improved before publication.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

back to top

Predicting Esophageal Fistula Risks Using a Multimodal Self-Attention Network