Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Zelin Qiu, Yongsheng Pan, Jie Wei, Dijia Wu, Yong Xia, Dinggang Shen

Abstract

Liver cancer is the third leading cause of cancer death in the world, where the hepatocellular carcinoma (HCC) is the most common case in primary liver cancer. In general diagnosis, accurate prediction of HCC grades is of great help to the subsequent treatment to improve the survival rate. Rather than to straightly predict HCC grades from images, it will be more interpretable in clinic to first predict the symptoms and then obtain the HCC grades from the Liver Imaging Reporting and Data System (LI-RADS). Accordingly, we propose a two-stage method for automatically predicting HCC grades according to multiphasic magnetic resonance imaging (MRI). The first stage uses multi-instance learning (MIL) to classify the LI-RADS symptoms while the second stage resorts LI-RADS to grade from the predicted symptoms. Since our method provides more diagnostic basis besides the grading results, it is more interpretable and closer to the clinical process. Experimental results on a dataset with 439 patients indicate that our two-stage method is more accurate than the straight HCC grading approach.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_42

SharedIt: https://rdcu.be/cyl6g

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a two-stage method to automatically grade Hepatocellular Carcinoma from MRI. The first stage is conserned with extracting so called “symptoms” from the MRI slices with clinically-motivated aggregation of the features from the different MRI phases. In the second stage, the grade is inferred based on the symptoms. The authors demonstrate a higher performance of the proposed method when compared to baseline methods without attention and guidance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- description of standard in diagnosis (LI-RADS)
- describes prior work
- motivation with interpretability of two-stage method
- Figures are well-prepared and contribute to the understanding
- sensible motivation for the aggregation structure
- ablation study
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Missing information about the dataset: The dataset description lacks very basic information. How was it obtained, how was the ground-gruth obtained, how were the segmentations performed?
- Missing information about the training/validation/test scheme. I suspect the authors used some form of cross-validation since the performance metrics are listed with standard deviations
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The experimental setup is clearly described. Adding information about the dataset and validation scheme would be absolutely crucial to increase the reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors clearly describe their proposed method and support the explanation with well-prepared figures. The motivation for the two-stage method, the aggregation and overall method is nicely set up and allows the reader to follow the reasoning. Prior work is described. The authors performed an ablation study to show the benefit of adding the proposed attention and guidance modules to the baseline method.

Providing a better dataset description is absolutely crucial, alongside how the ground truth was obtained. Furthermore, the paper is lacking necessary information about the validation scheme (was it cross-validation?) to build trust in the presented results. If the information about the dataset, ground truth and validation scheme would be available, I would recommend this contribution for acceptance.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I appreciate the clear motivation of the experimental setup and the ablation study performed by the authors. Major negative points for me are the lacking dataset description and that no information on the validation scheme is included. I wonder why no ethics approval was necessary as indicated on the reproducibility checklist.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The authors propose a deep learning-based framework for automatically predict the hepatocellular carcinoma (HCC) grades using multiphasic magnetic resonance imaging (MRI).

The proposed framework is used to extract features for the classification task. In the context of this work, the term “symptoms” describes a typical HCC lesion dynamic presentation, and it is subsequently used in the automatic grading system.

The proposed architecture is trained on a large-scale database, ImageNet, and subsequently tested on medium sized HCC MR imaging dataset.

The paper appears well written and of general interest.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed architecture includes an aggregation module. This enables handling multiple dynamic images phases which is very useful for this application, but also has potential beyond liver imaging.

The proposed work outperforms previously published work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It appears that the proposed work is not able to identify healthy cases. It should be stated as a limitation of the proposed method. How are slices where no pathology is identified dealt with?

While this is a very complex task, the overall accuracy of the grading system (Table 2) is below what is expected in a clinical practice context.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Based on the information provided by the authors upon submission the proposed work appears reproducible at this stage.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors state in the Abstract that the proposed work is tested on 439 patients. However, on page 6, section 3.1 Dataset and Experimental Setup, the authors state that the method was tested on 493. Can the authors please clarify the difference?

Can the authors please provide details on the scanner manufacturer? Is the data from a single center, or multi-center? Is it acquired using a single scanner manufacturer or multiple scanner manufacturers?

Page 5 (section “Guided Attention Based Multi-instance Classifier”: “(or a paired of slices)” should be “(or a pair of slices)”

Page 6 loss function equation description: “corss-entropy loss” should be “cross-entropy”

I recommend the authors to provide more details on the achieved accuracy score for HCC grading (Table 2) when compared with the classification of each symptom (Table 3)? I believe providing more context between Table 2 and 3 would help guide the reader, especially important for direct applications to clinical practice.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe the proposed work is in line with the MICCAI areas of interest and has academic merit.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

This paper compares a new method to other methods for analyzing dynamic contrast-enhanced (DCE) MRI of liver for classification into one of three LI-RAD grades for HCC. The results for the new method show improvement over the other four discussed when trained and evaluated using a set of images from 493 subjects. The new method is a two-stage method with a multi-instance learning first stage to detect “symptoms” and a second stage to provide the LI-RAD grade based on the symptoms.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths include:
- Applying machine learning techniques to detect features normally observed by a radiologist.
- Improvement on prior work.
- Good sized data set for exploratory work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Weaknesses include:
- (major) lack of description of how data is divided into training and testing sets or even if that was done.
- (trivial) assumption that higher F1 scores mean improvement without any statistical testing or power analysis.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The hardest piece to reproduce will be the data and cropping algorithm, as the data is not released and the cropping algorithm details are missing (how much of the surrounding area outside tumor should be included, how big is a cropped path). Most if not all parameter values are noted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please address weaknesses to improve paper.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methods describe a way to automate detection of LI-RAD features used to classify HCC.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Given three inconsistent reviews, you are accordingly invited to submit your rebuttals to address the major comments, especially to: 1) amend essential descriptions of the dataset and validation scheme, including how the data was collected (scanner manufacturer? a single center or multi-center), how was the ground-truth obtained, how were the segmentations performed, and how to divide the training/validation/test sets and perform cross-validation. 2) explain whether the proposed work is not able to identify healthy cases and how are slices where no pathology is identified dealt with? 3) explain whether the overall accuracy of the grading system (Table 2) meets clinical needs. 4) explain how many patients in this study, 439 or 493? 5) provide more details on the achieved accuracy score for HCC grading (Table 2) when compared with the classification of each symptom (Table 3).
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Author Feedback

We appreciate the efforts of all reviewers and ACs.

Q1: Descriptions of dataset and validation scheme (R1, R2, R3) A1: The data was collected from 439 untreated patients with suspected HCC (‘493’ in Section 3.1 was a typo) using either a 1.5 T (Magnetom Aera, Siemens Healthcare, Erlangen, Germany) or a 3.0 T (uMR 770, United Imaging Healthcare, Shanghai, China) MR scanner with Gd-EOB-DTPA as the contrast agent in the same hospital. Three radiologists with experience over 5 years were invited to provide manual ground-truth. Specifically, they were required to first make their own decisions individually, including framing out tumors, measuring tumor sizes, labeling the symptoms, and then make a common view when encountering inconsistency. For each multiple-tumor case, only the largest tumor is considered. Since our goal is HCC grading, segmentation is an upstream task, which provides bounding boxes for tumors by using 3D-UNet. As bounding box of each tumor will be enlarged (see Section 2.3) to keep the information about tumor boundary and surrounding liver parenchyma, it is not necessary to make a very precise segmentation. Actually, in clinical practice, it is easy for radiologists to draw a rough bounding box for containing the tumor when measuring tumor size. We divided the dataset into 5 folds with a similar class-wise distribution for cross-validation. In each trial, there are 351 training data (4 folds) and 88 validation data (1 fold).

Q2: How to deal with normal cases or slices (R2) A2: Normal and benign cases (i.e., LR-1 and LR-2) can be identified relatively easily without pathologic proof (see [6], [8]). Therefore, we focus our research on HCC cases (i.e., from LR-3 to LR-5), which is more significant in clinical practice. Thus, the dataset for this study is specially designed for classifying tumors within LR-3 to LR-5, so does our method. Nonetheless, it is a good suggestion to consider all cases in our future work. Our grading result is based on subject, i.e., a diagnosis label is assigned to each subject, not each slice. Since each subject has multiple slices, we formulate this grading task as a multi-instance learning problem, where each pair of features is treated as an instance and each subject is treated as a bag of instances. The model is supervised only by the bag-level labels (i.e., the grade of each subject). Every negative instance (i.e., the feature pairs from the slices with no pathology) accounts for labeling, but the bag is labeled negative if and only if there is no positive instance.

Q3: Does the accuracy meet clinical needs (R2) A3: HCC diagnosis requires specific expertise and is easily influenced by radiologist’s experience. Its misdiagnosis rate in clinical practice is relatively high (N Mehta, et al. Clinical transplantation 2017). Our method can predict HCC grade using MR imaging with only extra rough bounding box, requiring less expertise. In our experiment, only 12.5% of the cases have their predicted grades less than the ground truth, meaning over 87.5% of patients are either the same as prediction or less serious than the predicted condition, which, according to our clinical collaborator, is acceptable as a supplementary in clinical practice. Besides, since deep learning-based model usually benefits from a large set of training data, our model has the potential to achieve higher performance with the increase of collected dataset.

Q4: Context between Table 2 and 3 (R2) A4: The grading result is directly derived from symptom predictions according to the LI-RADS diagnostic table (see Table 1). LI-RADS uses ‘APHE’ as the first index and the count of other major features (Capsule and Washout in this paper) as the second index. When we have predictions of these symptoms, we can get the corresponding HCC grade by looking up Table 1.

Q5: Inadequate analysis of F1 score (R3) A5: F1 score is a comprehensive consideration of precision and recall, and thus is more convincing than other metrics.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Basically, the reviews are positive and consistent. The novelty is recognized, and the authors’ response explains some essential details especially on the dataset, validation scheme, and normal cases. In summary, I agree to accept this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a deep learning-based framework for automatically predict the hepatocellular carcinoma (HCC) grades using multiphasic magnetic resonance imaging (MRI). Two reviewers give high comments. The major concerns of this paper are the unclear descriptions of experiment setting and criteria for evaluation. In the rebuttal, the author addressed these concerns and I would suggest the author add these descriptions in the final version of the MICCAI paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Well written paper and interesting topic, proposing new method capturing temporal enhancement information. I agree the validation scheme was very short in details. But with the information provided in the rebuttal regarding the dataset description and validation scheme, addressing the specific requests from R1, my recommendation would be to accept, provided the detailed provided in the rebuttal are indeed included in the revised paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

back to top

Predicting Symptoms from Multiphasic MRI via Multi-Instance Attention Learning for Hepatocellular Carcinoma Grading