Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhenjie Cao, Zhicheng Yang, Yuxing Tang, Yanbo Zhang, Mei Han, Jing Xiao, Jie Ma, Peng Chang

Abstract

Inspired by the recent success of self-supervised contrastive pre-training on ImageNet, this paper presents a novel framework of Supervised Contrastive Pre-training (SCP) followed by Supervised Fine-tuning (SF) to improve mammographic triage screening models. Our experiments on a large-scale dataset show that the SCP step can effectively learn a better embedding and subsequently improve the final model performance in comparison with the direct supervised training approach. Superior results of AUC and specificity/sensitivity have been achieved for our mammographic screening task compared to previously reported SOTA approaches.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_13

SharedIt: https://rdcu.be/cyl78

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents an improved normal/abnormal discriminator for mammography images using a supervised contrastive loss-based embedding followed by a projection network for the binary classification. The contrastive loss is based on L2 distance. The results show nearly 6% improvement in AUC and 0.6% improvement in specificity. This approach is well adopted in computer vision literature. The authors have shown validity of the approach to the mammography task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The approach taken by the authors is sensible and consistent with the improved performance we are seeing for adopting the clustering ideas to deep learning classifiers in the broader visual and text analytics tasks as well.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    A problem I have with the paper is the use of birad-1 as a cutoff point for normal/abnormal discrimination. Typically, the birad-3 is a borderline case and the discriminating around it for normal/abnormal would be a more interesting and realistic use case. Can the authors show the performance improvement even in this case using the contrastive approach?

    Also, there were earlier releases of datasets on which this problem could be attempted (DREAM challenge) and a few datasets still available for experimentation: wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM (new release with cleanup of format issues) http://peipa.essex.ac.uk/info/mias.html (has normal/abnormal information as well but the imaging may be in other forms) on Kaggle.

    The ablation study on contrastive loss functions tested with cosine distance but not the more commonly adopted variants such as triplet loss, magnet loss or softmax-based formulations. This would also strengthen the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Possibly not since the dataset seems to be proprietary. But the methods are straightforward to implement with existing contrastive embedding code out there.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Please see the two suggestions made to improve the paper. The current SOTA in industrial strength software for normal/abnormal for mammograms appears to be 99.9% at 20% sensitivity. The DREAM challenge results indicated specificity at 86% sensitivity as 0.692 (Individual) and 0.761 (ensemble) methods. So these results for the easier task attempted in the paper are not surprising. If the sensitivity is similarly raised to 86% how would this change for the method? Is there any aspect of the contrastive learning that will improve the selectivity at higher sensitivities? Answering these in the rebuttal would also help convince the reviewer that is incremental contribution here.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach taken is reasonable. The current experiments and criteria set for normal/abnormal and evaluation at 20% sensitivity are all issues I see when there is considerable work on mammography evaluation for much harder tasks even in the normal/abnormal categories during the earlier DREAM effort.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    Paper proposes a framework of Supervised Contrastive Pre-training (SCP) followed by Supervised Finetuning (SF) to improve mammographic triage screening models. Goal is to triage mammograms that are cancer-free to reduce radiologists’ workload, and improve efficiency and specificity, without sacrificing sensitivity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • uses a relatively large dataset of cases
    • method has some relatively novel aspects combining Supervised Contrastive Pre-training (SCP) followed by Supervised Finetuning (SF)
    • addresses a clinically relevant problem
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • not clear if gold standard is based on radiology report of biopsy data - sounds like radiology report which is not adequate
    • also not clear what gold standard is not normal - usually require at least 2 years with normal report but here it sounds like 1 report was used.
    • data in table 2 need to be compared statistically for significant differences
    • would like to see more discussion of cases where system failed - both normal called abnormal and abnormal called normal - were there patterns that could help guide future research for example
    • conclusion needs to discuss study limitations
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Overall seems like methods are clear enough for others to replicate results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • data in table 2 need to be compared statistically for significant differences
    • would like to see more discussion of cases where system failed - both normal called abnormal and abnormal called normal - were there patterns that could help guide future research for example
    • conclusion needs to discuss study limitations
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good study with large dataset. Needs a bit of stats work to strengthen but overall solid and of interest to attendees.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The authors present a method for classifying normal (BI-RADS 1) vs abnormal (other BI-RADS) mammograms based on supervised contrastive pre-training (SCP) followed by supervised fine-tuning (SF). Using a large-scale data set (30k patients) the authors demonstrate that SCP learns a better embedding, leading to better model performance on the final task compared to other SOTA approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    • The paper represents a novel approach to triaging normal mammograms. Specifically, the use of supervised contrastive pre-training has demonstrated utility in the computer vision community (eg, based on ImageNet results), but hasn’t been applied to the task of triage in screening mammography.
    • Furthermore, this approach may be applicable to additional tasks (such as malignant vs non-malignant classification) in screening mammography.
    • The authors evaluate both single-view and dual-view architectures, and demonstrate improvement for the dual-view approach.
    • The performance is superior to other SOTA methods that were implemented by the authors.
    • The paper is clearly described.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses:

    Description of the data set:

    • It’s unclear if the MG images in the train/test data set all correspond to screening exams. Typically in screening, BI-RADS assessments are either 1,2 or 0 (although there are often exceptions). Whereas in diagnostic setting, BI-RADS is typically either 2,3,4,5,6. In particular, I find it surprising to have BI-RADS 6 (known cancer) in the screening scenario. If the dataset contains both screening and diagnostic studies, the task of separating normal from abnormal studies could be significantly easier, because diagnostic studies typically contain additional views (spot compression / magnification). How were screening studies (and not diagnostic studies) selected for this work?
    • The introduction references “biopsy-proven ground truth”; however, the description of the dataset (Section 4 - Datasets) does not reference biopsy, or break down the cases into the biopsy outcome (positive, negative, high risk).

    Definition of normal vs abnormal:

    • In clinical practice, the distinction between BI-RADS 1 and BI-RADS 2 is highly subjective, and based on the reporting preferences of the reading radiologist. For example, both BI-RADS 1 and BI-RADS 2 correspond to an assessment of essentially 0% chance of malignancy. The primary difference, is that if the radiologist chooses to describe a benign finding in the report (which can for example be implants, or the presence of a marker) then the study becomes BI-RADS 2. As such, I would expect the separation of BI-RADS 1 vs BI-RADS 2 to be extremely noisy. A better definition of normal vs abnormal would either be BI-RADS 1 and 2 vs other BI-RADS (0,3,4,5,6), or even better would be to base the separation on biopsy outcome (or normal follow-up, in the case of normal exams that did not result in biopsy).

    Limited evaluation:

    • There is no discussion of limitations of the method, or failure modes that were observed.
    • There is also no evaluation of other clustering tasks, such as the separation of biopsy positive (malignant) vs other (non-malignant) studies, which is an important problem.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is clearly described; however, as it is a relatively new method, and the data set cannot be made public, reproducibility would be greatly improved if the training and evaluation code were made public.

    A few additional points:

    • The authors state in the reproducibility checklist that they demonstrate statistical significance of their results; however, I do not see this evaluation in the paper.
    • The authors provide a breakdown (by BI-RADS value) of the studies incorrectly predicted by the algorithm as normal in Table 2; however, a more detailed analysis of the failure modes would be helpful. Similarly, are there certain properties of normal images (eg, high density, or other artifacts), that lead them to be falsely categorized as abnormal?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Clustering of sub-categories:

    • A case that has been assessed as BI-RADS 5 is significantly more abnormal than a case with a BI-RADS 2 or BI-RADS 3 assessment. It would be interesting to color each BI-RADS category separately in Figure 3, to see if there is more separation between BI-RADS 5 vs 1, as opposed to BI-RADS 3 vs 1.

    Evaluation metric:

    • The choice of sensitivity to refer to identifying “normal” studies might lead to confusion. Typically an abnormal case is considered positive, and sensitivity refers to the ability to identify abnormal cases. I would recommend flipping the definition of positive vs negative cases, and measure the specificity of the algorithm as its ability to identify/triage normal cases.

    Minor edits:

    • Page 2: authors refer to “healthy and at risk populations”. BI-RADS 1 vs other is not a good separator of healthy vs at risk. In particular, BI-RADS 2 is an assessment with essentially 0% chance of malignancy, and is therefore not “at risk”.
    • Page 6: “we set the sensitivity (recall rate of normal images)…”. This could be confusing, as “recall” in screening mammography refers to an abnormal assessment (ie, recall the patient to diagnostic imaging).
  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors that led to my recommendation were:

    • the novelty of the approach.
    • the potential wide applicability of this method to related tasks in medical imaging.
    • the large training and validation data set used to validate the approach
    • implementation and comparison of other SOTA methods on the same data set.
  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a supervised contrastive loss-based embedding learning followed by a binary classification of mammograms. Results show a 6% AUC improvement and 0.6% specificity improvement. The paper received positive reviews and I’m recommending its provisional acceptance. I encourage the authors to address the following questions in the final submission: 1) why is birad-1 used as a cutoff point?; 2) does contrastive learning improve the selectivity at higher sensitivities?; 3) is ground truth based on radiology report of biopsy data?; 4) please add staitistical significance tests on Table 2; 5) please add discussions about failure cases; and 6) please clarify if mammograms in the data set correspond to screening exams.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    2




Author Feedback

We thank all the constructive and insightful comments from the reviewers. Due to the limit of space, we choose to address some common questions here.

Regarding using a proprietary dataset, to our knowledge, there is no existing public mammogram dataset that has the patient distribution in accordance with the clinical screening scenario in terms of BIRADS levels, including DDSM. Thus we build our own proprietary dataset for the experiments.

Regarding the statistical significance of our results, we will provide these results in the camera-ready version.

Regarding the details of the failure cases in table 2, for our best model, out of the seven BI-RADS 2 mammograms, one has mass lesion, and the rest contain scattered benign calcifications; out of the five BI-RADS 3 mammograms, two have mass lesion, and three have small calcifications.

Regarding defining the problem under another borderline BI-RADS category, we have conducted experiments when cutting off at BI-RADS 4. Our best model raises the AUC from 0.9061 to 0.9270 when compared with Wu et al.[25]’s 4-view model, which is also the known best existing model on this task before our approach.



back to top