Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Hao-Hsiang Yang, Fu-En Wang, Cheng Sun, Kuan-Chih Huang, Hung-Wei Chen, Yi Chen, Hung-Chih Chen, Chun-Yu Liao, Shih-Hsuan Kao, Yu-Chiang Frank Wang, Chou-Chin Lan

Abstract

Pulmonary nodule detection from lung computed tomography (CT) scans has been an active clinical research direction, benefiting the early diagnosis of lung cancer related disease. However, state-of-the-art deep learning models require instance-level annotation for the training data (i.e., a bounding box for each nodule), which require expensive costs and might not always be applicable. On the other hand, during clinical diagnosis of lung nodule detection, radiologists provide electronic medical records (EMR), which contain information such as the malignancy, number, texture of the detected nodules, and slice indices at which the nodules are located. Thus, the goal of this work is to utilize EMR information for learning pulmonary nodule detection models, without observing any nodule annotation during the training stage. To realize the above weakly supervised learning strategy, we extend multiple instance learning (MIL) and specifically take the presence and number of nodules in each CT scan, as well as the associated slice information, in our proposed deep learning framework. In our experiments, we present proper evaluation metrics for assessing and comparing the effectiveness of state-of-the-art models on multiple datasets, which verify the practicality of our proposed model.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_24

SharedIt: https://rdcu.be/cyl8k

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors present a novel learning framework for weakly supervised pulmonary nodule detection, which leverage auxiliary weak labels of number and slice index of the observed nodules from electronic medical records. The weak labels are integrated into a deep MIL based framework with the associated objectives, while no ground truth instance level nodule information is utilized during training. A pseudo scoring assignment is proposed to supervise these weak labels. Both public and private datasets are used for evaluation in the experiments with a number of evaluation metrics (in both supervised and weakly supervised settings). The effectiveness of the proposed framework design has been successfully confirmed, while the model is shown to perform favorably state-of-the-art WSOD methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A novel framework for weakly supervised pulmonary nodule detection is proposed, which leverage auxiliary weak labels. Both public and private datasets are used in the experiments, the results confirm that the proposed model for pulmonary nodule detection outperforms the state-of-the-art WSOD methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) The auxiliary features including number of nodules, and the associated slice indices, were used. However, the slice indices did not significantly improve the performance. 2) The weakly supervised object detection is an interesting research topic, however, there are sufficient publicly available dataset with nice annotations for pulmonary nodule detection. The supervised deep learning methods trained with these data achieved outstanding performance. Thus, applying WSOD to pulmonary nodule detection is trivial.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors didn’t procide code. In the experiments, both public and private data are used. Thr reproducibility should be OK.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Extending the weakly supervised object detection to other related topics, which lacks of well-annotated data, will enhance the study.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A novel framework for weakly supervised pulmonary nodule detection is proposed, and both public and private datasets are used in the experiments, demonstrating the state-of-the-art performance over other WSOD methods.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This paper describes a method to detect nodules in 3D medical images without instance-level annotation. Instead, electronic medical records (EMR) are used, which contain numbers of nodules and slice-level annotations of the nodules, to train a detector in a weakly-supervised manner.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Existing EMR, which only contain slice-level annotations, can also be used to train the proposed model. This increases the amount of annotated data significantly.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some step of the proposed method may still need instance-level annotation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Datasets are clearly described. For method and evaluation, some details are missing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- In Sec. 3.1: a pre-trained 3D-FPN is used to extract features and bounding boxes of the nodules. The performance of the proposed method depends decisively on the 3D-FPN. However, this 3D-FPN is trained in a supervised way, so instance-level annotation is still necessary.
- In Sec. 3.2: The description of AU-HIROC is not clear.
- In Sec. References: Some references are not complete, for example, [2, 26, 27]
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method does not only use information from EMR, as the title “weakly supervised” suggests. Instead, instance-level annotation is still necessary to perform a fully-supervised training for the 3D-FPN, which is a very important component of the proposed method.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper studies how to leverage the coarse-labeled EMR data for the Pulmonary Nodule detection task when fine-labeled (instance level) data are limited. It proposes a weakly supervised deep learning framework that can efficiently exploit the information binary label and the “slice” of the interest. Extensive experiments demonstrate it can outperform existing SOTA approaches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. introducing weakly supervised learning in nodule detection task does make much sense as labeling “objects” in medical images at instance level is expensive and sometimes not viable
2. leveraging slice number as the weak supervision delivers much value because such information is not hard to collect
3. experiment and evaluation metrics are well designed, covering both the fully and weakly supervised settings
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. in the full supervision setting (Table 1), the private dataset is too small (only 60), making it less persuasive. why not evaluate with Tianchi as it is public and contains more images? Even though, the weakly supervised training is on the private dataset, making the evaluation on the same dataset is straightforward. Evaluating it on a public dataset would give a sense of how good it is when the domain is transferred. And more importantly, it will make followers easy to make a comparison.
2. how well does the pre-trained model perform? was it well trained? If the pre-trained model is poorly trained, a large performance gap may not give any sense. Performance on the training dataset should be enough
3. in Table 1, slice index information does not provide much help in AU-HIROC evaluation. This is a little surprising, as this metric is specifically designed for it.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Results should be partially reproducible as a large weakly labelled private data is adopted to train the model.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. Please address my concerns listed in weakness section
2. in general, this paper is well organized, not hard to follow. However, I do experience a hard time when comes to to the score definition (on page 5), eg. P(h_i; 1,k) vs P(h_i; p_min, p_max). I would suggest the authors revise that part, and make it more clear.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Some results are on a very small dataset, making it less convincing.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

1
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All reviews are consistently positive and the averaged score is above the cutoff threshold. The approach of using or converting EMR for weakly supervised supervision of object detection is interesting to some degree. However the nodule detection examples shown in the paper can be misleading to some degree (as very tiny nodules were detected from the examples, and they are probably not clinically significant).
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Author Feedback

We thank the remarks and suggestions from all the reviewers, which help us strengthen our work. We now clarify the issues raised by the reviewers as follows:

[R1 & R4] Use of slice indices does not seem to help a lot. We thank the reviewers for pointing this out. In fact, as listed in Table 1, we consider both AU-(HI)ROC and (W)CPM as the evaluation metrics. Although the AUROC improvements using slice indices are not as significant as those using CPM (e.g., 0.664 vs. 0.66 on AU-HIROC, 0.771 vs. 0.758 on WCPM), we need to point out that CPM denotes the detection accuracy at the nodule instance level, while AUROC only reflects the accuracy at the CT slice level (i.e., no bounding box info). Thus, the improvements on CPM/WPCM are not only more significant but also more practical.

[R1] Public datasets are sufficiently large and thus WSOD is trivial. We only partially agree with this remark. Although a number of large-scale public datasets are available, models trained on such datasets are typically not generalized to target datasets of interest due to domain gap (i.e., dataset bias). Since annotating target datasets would be expensive in practice, WSOD would be of sufficient interest in practical scenarios.

[R2] Some steps of the proposed method still need instance-level annotation. We apply LUNA with instance-level annotation to pretrain our model. When adapting our model to Tianchi or our private datasets, only EMR information is observed. This is why our model is trained in WSOD fashion. This is consistent to the learning strategy of most image classification works, which would pretrain their CNNs using ImageNet, followed by downstream learning tasks.

[R2] Description of AU-HIROC needs to be improved. Some references are not complete, e.g., [2, 26, 27]. The standard AUROC considers the presence of nodules in a 3D CT scan. As for our AU-HIROC, it calculates the area under the ROC with hybrid information, which additionally takes slice index and number of nodules into consideration (see our eqn (3) and the associated remarks). We also thank the reviewer for pointing out the references to be edited/completed.

[R4] The private dataset is too small (only 60), making it less persuasive. Why not evaluate with Tianchi as it is public and contains more images? Actually, Table 1 (on private dataset) is for ablation study purposes. We did present experiments on Tianchi dataset and compare to SOTAs in Table 2. [R4] How well does the pre-trained model perform? Was it well trained? In Table 1, we listed the performance of the pre-trained model (on LUNA with 0.89 CPM), which is comparable to that reported in a recent MICAAI’18 work of Khosravan & Bagci. and IEEE access,20 work of Gong et.al. For fair comparison purposes, all WSOD models in our experiments are finetuned from this pre-trained model.

[R4] What’s the difference between the score definitions P(h; 1, k) and P(h, p_min, p_max) in page 5?

In Equation (2), P is used to include the detection proposals of different rankings, and h is the detection score for the corresponding proposal. More precisely, P(h; 1, k) denotes the proposal subset with top-1 to top-k detection scores. Alternatively, P(h; p_min, p_max) indicates the subset with proposal scores ranking from p_min to p_max. We will revise the description and make it clearer in future versions.

back to top

Leveraging Auxiliary Information from EMR for Weakly Supervised Pulmonary Nodule Detection