Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Luyang Luo, Hao Chen, Yanning Zhou, Huangjing Lin, Pheng-Ann Heng

Abstract

Chest X-ray (CXR) is the most typical diagnostic X-ray examination for screening various thoracic diseases. Automatically localizing lesions from CXR is promising for alleviating radiologists’ reading burden. However, CXR datasets are often with massive image-level annotations and scarce lesion-level annotations, and more often, without annotations. Thus far, unifying different supervision granularities to develop thoracic disease detection algorithms has not been comprehensively addressed. In this paper, we present OXnet, the first deep omni-supervised thoracic disease detection network to our best knowledge that uses as much available supervision as possible for CXR diagnosis. We first introduce supervised learning via a one-stage detection model. Then, we inject a global classification head to the detection model and propose dual attention alignment to guide the global gradient to the local detection branch, which enables learning lesion detection from image-level annotations. We also impose intra-class compactness and inter-class separability with global prototype alignment to further enhance the global information learning. Moreover, we leverage a soft focal loss to distill the soft pseudo-labels of unlabeled data generated by a teacher model. Extensive experiments on a large-scale chest X-ray dataset show the proposed OXnet outperforms competitive methods with significant margins. Further, we investigate omni-supervision under various annotation granularities and corroborate OXnet is a promising choice to mitigate the plight of annotation shortage for medical image diagnosis.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_50

SharedIt: https://rdcu.be/cyl21

Link to the code repository

https://github.com/LLYXC/OXnet

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors perform omni-supervised learning by combining fully supervised, weakly-supervised, and unsupervised learning modules toward classifying CXRs into multiple disease labels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Global alignment module proposed (global = across diseases/manifestations) that imposes intra-class compactness and inter-class separation to improve classification.
2. Focal loss is extended to distill knowledge from teacher model.
3. Visualization experiments are performed.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

A primary weakness is lack of statistical validation of results. Other weaknesses include insufficient comparison with prior art (e.g., unsupervised contrastive loss). Lack of detail in how consensus was arrived at. Explanation of CAM results beyond just observational statements.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Due to lack of detail I would rate that the reproducibility would be moderate to poor.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The authors mention that “To further enhance global information learning, we impose intra-class compactness and inter-class separability with a global prototype alignment module.” How does this performance compare to using unsupervised contrastive loss toward multi-class classification tasks? The authors mention that “For unsupervised data learning, we extend the focal loss to be its soft form to distill knowledge from a teacher model. Extensive experiments show the proposed OXnet outperforms competitive methods with significant margins.” However, the performance of the models shown in Table 1 and Table 2 are not evaluated for statistical significance. The authors have evaluated the models with a fixed train/validation/test split. Statistical significance analysis with K-fold cross-validation would help analyze the existence of a significant difference in models’ performance. It is not clear how the authors arrived at the consensus annotations. Details are not provided about the expertise of the radiologists and/or the annotation tools used in this study. It is recommended to refer to literature like https://pubmed.ncbi.nlm.nih.gov/33180877/ to develop an annotation consensus and compare the performance with the approach proposed if any. The reason behind using the RetinaNet and the ResNet-101 backbone as the baseline needs justification. How did the authors optimize the model and its backbone for the current object detection task? What is the difference in performance observed while using Faster-RCNN and/or other object detection models? Optimal model selection as the baseline is indispensable since the authors compare the proposed approach with the baseline model and investigate for statistical significance. In supplementary Fig. 1, the authors mention that “local attention often helps refine CAM (row 1 and 2), but sometimes CAM covers more accurate lesion regions (row 3).” Such an observation needs more evaluation and discussion. How do the authors generalize this approach for real-time applications? What is the average IoU, and mAP obtained for class-specific labels using global attention- and local attention-based ROI detection? Did global attention or local attention help achieve better performance? Image-level analysis may not be a good representation of the efficacy of the localization approach. The idea behind doing element-wise AND is not clear. If the local attention is better compared to global attention, element-wise AND would retain those common pixels in the attention maps, and the benefit of local attention may not be exploited. What about choosing an average attention map under these circumstances? How do these various element-wise operation methods impact the performance of localization? Fig. 2 and supplementary figure 1 show that the extent of annotations made by the radiologists (is it consensus annotation?) is bigger compared to the location achieved through global or local attention. It is not clear why the authors used AP-40 and AP-75 measures for the current task. What is the mean average precision observed in the range of 0.5 to 0.95 intersection at intervals of 0.05? How does this mAP (0.05, 0.95) differ from that obtained with AP-40 and AP-75 metrics? Is the difference statistically significant? The authors have not specified the source of the data and how it is acquired. Is it a private dataset? Why have they taken only nine diseases into account? How the data was annotated? What is the impact of changing the input image resolution on the proposed approach? Did the radiologists annotate the resize images or original images? How did the authors scale the annotations in that case? Did the authors work with the resized CXRs or the lung segmented ROIs?
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Too many unanswered questions. See constructive comments above.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose a unified deep framework for omni-supervised chest X-ray disease detection. It can utilize the limited bounding box annotations, weakly labeled and unlabeled data at the same time.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors propose many interesting ideas, e.g., local and global attention alignment module or soft focal loss for mean teacher training. Meanwhile, many experiments are conducted to prove the effectiveness and the experimental results are very detailed.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The novelty of this paper is limited. They combine many different methods for the proposed framework, e.g., CAM for attention, Mean Teacher for the training of unlabeled data. It does not look like a unified overall framework but a combination of some different methods. For the experimental part, they do not compare the localization performance with some SOTA methods ([12,15,17,33] in their paper) in open-source datasets, e.g., NIH chest X-ray14 or Stanford CheXpert.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will release the training code and models, which is great for the reproducibility. However, the data for this paper is collected by themself and won’t be released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
I appreciate that the authors propose a complex framework with many components for disease localization. Also, they conduct detailed experiments in the collected dataset. However, I still have some concerns about this paper:
1. The whole framework consists of many existing methods (RetianNet, CAM, metric learning, Mean Teacher), and the authors combine them together with some new modules (e.g., modified focal loss, attention module). I believe this framework can work for disease localization with different labeled data. However, the similar attention modoles have been proved to be useful in many recent paper. Mean Teacher is a popular mechanism for unlabeled data. Also metric learning and focal loss are all well-studied methods. Therefore, I feel the paper is more like incremental work with combination of many different methods, which make the novelty weak.
2. Although the authors conducts many experiments on their collected dataset, they do not compare the performance with some SOTA methods in open-source datasets. They also list many SOTA methods ([12,15,17,33] in their paper) in NIH chest X-ray14, and all these methods compare their localization performance with weakly supervised settings in NIH Chest X-ray14 dataset. I believe the authors should also conduct similar experiments in NIH Chest X-ray14 to prove the disease localization performance of their method.
3. The authors should propose the quantified results of local and global attention, which can better understand the effectiveness of these attention module. Also, these results can be used for localization when no any bounding boxes in dataset.
4. I think the statement for the related work in Page 2 is wrong: “Note that the works on CXR14 [26] that leverage both image-level and lesion-level annotations [12,15,17,33] only care whether the attentions cover a target single lesion, which often does not hold in real-world scenarios where multiple lesions could exist.” I believe these method can deliver multiple lesions for different disease. For more details, the authors can refer to the work [17].
5. All the mAP performance in this paper are low (around 22) even the models trained with 2725 fully-annotated images. Maybe, they could try more detectors for this problem, e.g., cascade faster rcnn.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is more like incremental work with combination of many different methods, which make the novelty weak. They do not compare the localization performance with some SOTA methods ([12,15,17,33] in their paper) in open-source datasets, e.g., NIH chest X-ray14 or Stanford CheXpert.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The authors propose a CXR classification framework that combines supervised, semi-supervised and supervised learning methods to both mitigate the need for labeled data, but also optimally use all available labeled data.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-clear motivation and description of methods. The combination of multiple levels of annotations, unified in one model, is very relevant to the medical domain.

-The authors present a very well written and explained paper. Both the methodology and experiments are well explained and supported with tables and figures and the text is easy to follow throughout.

-Eleborate comparison to results of previous work, both in terms of mAP and visually show the benefit of the proposed method
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-some information on the implementation of the framework and training settings is partly missing. This includes hyperparameters like learning rate and optimizers, but also information on computational training time/cost
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Experimental details needed for reproduction such as hyperparameters are missing. Authors indicated code will be made available
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Interesting work, inclusion of more details on implementation/training settings/computational cost can further improve the paper
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall a very nice study & has solid results provided
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Reviews were mixed, with R1 giving a negative assessment, R3 a positive assessment, and R2 a borderline one. However, R2 expressed some strong misgivings despite the borderline accept rating. While I agree with R2 that comparisons on open source datasets are welcome, NIH CXR14 localizations are not very precise, so I think comparisons using that dataset for localization is not very informative. Nonetheless, I do agree that there are other important questions that should be addressed.

In the rebuttal, authors should focus on R1 and R2’s criticisms. In particular questions on:

–statistical significance of improvements –justification of experimental settings: baseline choice, metric choice (AP-40/AP-75) –R1 asks: “The authors have not specified the source of the data and how it is acquired. Is it a private dataset? Why have they taken only nine diseases into account? How the data was annotated? What is the impact of changing the input image resolution on the proposed approach? Did the radiologists annotate the resize images or original images? How did the authors scale the annotations in that case? Did the authors work with the resized CXRs or the lung segmented ROIs?” –dataset construction (annotation process, reader expertise) –lack of comparisons with SOTA (detectors on fully supervised images, [12,15,17,33] –justification of design choices: local vs global attention; choice of element-wise AND – questions as to novelty.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We thank all reviewers for the constructive comments. Replies to major concerns are listed below, and we will address other minor issues in the final version.

1.Statistical significance of improvements [R1] Bootstrap on test set (n=13180) with two-sided student t-test show that OXnet’s mAP (mean+-std: 22.4+/-0.3) is significantly higher (p values<0.0001) than RetinaNet (18.5+/-0.3), best semi-supervised model MeanTeacher (20.1+/-0.3), best multi-task model MeanTeacher+MT (20.5+/-0.3), SFL (20.5+/-0.3), DAA (21.2+/-0.3), and DAA+GPA (21.4+/-0.3) in Table 1.

2.Experimental settings [R1,2,3] (1) Baseline choice: RetinaNet has 18.4 mAP (Res101; input size = 512), 17.3 mAP (Res50; input size = 512), and 18.1 mAP (Res101; input size = 448). We chose Res101 and input size 512 for subsequent experiments. (2) Metric choice: Drawing perfect bounding boxes for lesions is hard, so we selected AP40, AP75, and mAP(40-75), following [4] and Kaggle Pneumonia Detection Challenge (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge).

3.Dataset construction [R1] (1) Annotations: Images were kept the same size as the original DICOM file. A cohort of 10 radiologists (4-30 years of experience) was involved in labeling. Each image was labeled by two with text report. If the initial labelers disagreed on the annotation, a senior radiologist (>= 20-year experience) would make the final decision. Nine common thoracic diseases were finally chosen by the radiologists’ consensus. (2) Image size: Model inputs are resized images without cropping, and annotations are resized by dividing the image resizing ratio.

4.Comparison with others [R1,2] (1) SOTA semi-supervised and omni-supervised detectors: In table 1, we have showed superior performance of our OXnet against 4 SOTA semi-supervised detectors [14,23,25,32], plus 4 omni-supervised detectors modified from [14,23,25,32]. (2) Fully-supervised detectors: 1-stage RetinaNet and FCOS achieved 18.4 and 19.6 mAPs, respectively. 2-stage Faster RCNN reached 20.1 mAP. Our OXnet achieved 22.3 mAP. The choice of baseline is orthogonal to our main contributions on omni-supervised learning, and our method can be generalized to other detectors. (3) Other methods on NIH CXR14: Following [12] with 5-fold cross-validation, our OXnet reached localization accuracy of 0.70, 0.56, 0.43, and 0.21 under IoU 0.1, 0.3, 0.5, and 0.7, respectively, reaching an average accuracy of 0.48, whereas average accuracies of [26,12,17] are 0.22, 0.40, and 0.48, respectively. As mentioned by meta-reviewer, evaluation on NIH CXR14 may not be very informative. Nevertheless, these results demonstrate the competitiveness of our method.

5.Design Choice [R1,2] (1) Attention: Localization accuracy (defined in [12]) of local attention (0.38/0.43) is better than global attention (0.23/0.31) without/with the proposed dual attention alignment (DAA). Aligning the attentions via DAA benefits both attentions’ localization ability. (2) Elementwise operation: We use elementwise MULTIPLICATION, not AND mentioned by R1, to allow the local attention to weighing the contribution of each pixel from the global branch, which has a similar motivation as that used in the wide literature of multi-instance learning [8,22].

6.Novelty [R2] To our best knowledge, we proposed the FIRST unified omni-supervised detection framework for chest X-ray thoracic disease detection. To unify supervision from weakly labeled data, we construct a novel attention alignment module that elegantly enables top-down gradient propagation from the global to the local path. Owing to this module, we further enhance weakly-labeled data learning via novel global prototype alignment on category-disentangled features. To unify unsupervised learning, we incorporate soft focal loss to efficiently learn from soft targets generated by a teacher detector. The proposed OXnet makes full use of all kinds of supervision as a unified framework, and extensive experiments demonstrate its efficacy.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors convincingly addressed many reviewer concerns in their rebuttal - in particular by showing statistical significance and clearing up details on the experimental settings. I feel the paper provides a convincing demonstration of its approach to thoracic disease detection.

Should the paper receive acceptance, I encourage the authors to fill in the critical details as to experimental settings, dataset construction, and design choice either in their main body or, when appropriate, in the supplemental (note the supplemental only allows tables, figures, and theorem proofs, https://miccai2021.org/en/PAPER-SUBMISSION-GUIDELINES.html).
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
This paper presented a neural network to address the omni-supervised learning problem to make use both labeled and unlabeled data.
1. As recognized by the reviewers, the experimental data used in this studied are not clearly described, as in the questions raised by Reviewer #1. Yet, the rebuttal seems not sufficiently answer the questions and concerns.
2. In addition, the presented neural network indeed is a pretty straightforward combination of several well-know neural network subcomponents. So the novelty of this paper is questionable.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall this is a technical incremental but very nicely and principled designed paper. The technical aspects are very solid and reasonable. The problem to solve here of using all available supervision information for a noisy image classification problem is well motivated as well. The experimental results are sufficient to be accepted as well. R1 complains about statistical testing which is well addressed in the rebuttal. This will make a solid MICCAI poster paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

back to top

OXnet: Deep Omni-supervised Thoracic Disease Detection from Chest X-rays