Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xi Ouyang, Jifei Che, Qitian Chen, Zheren Li, Yiqiang Zhan, Zhong Xue, Qian Wang, Jie-Zhi Cheng, Dinggang Shen

Abstract

Microcalcification (MC) clusters in mammograms are one of the primary signs of breast cancer. In the literature, most MC detection methods follow a two-step paradigm: segmenting each MC and analyzing their spatial distributions to form MC clusters. However, segmentation of MCs cannot avoid low sensitivity or high false positives due to their variability in size (sometimes < 0.1 mm), brightness, and shape (with diverse surroundings). In this paper, we propose a novel self-adversarial learning framework to differentiate and delineate the MC clusters in an end-to-end manner. The class activation mapping (CAM) mechanism is employed to directly generate the contours of MC clusters with the guidance of MC cluster classification and box annotations. We also propose the self-adversarial learning strategy to equip CAM with better detection capability of MC clusters by using the backbone network itself as a discriminator. Experimental results suggest that our method can achieve better performance for MC cluster detection with the contouring of MC clusters and classification of MC types.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87234-2_8

SharedIt: https://rdcu.be/cyl73

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes self-adversarial learning framework to differentiate and delineate MC clusters on mammograms in an end-to-end manner covering identifying MC status & delineation (via segmentation) for localization purposes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Fairly large dataset used
    • Methods relatively novel
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • well-worn topic area with lots of prior research
    • not clear that it really adds much as there are already so many studies/approaches on this topic with MCs being an relatively easy task so lots of approaches have excellent results already
    • have multiple images per patient and randomly divided into various sets thus images from same patient could be in same set but not treated independently
    • tables 1 & 2 need statistical comparison to test for significant differences
    • visualization results not very objective
    • lesion characteristics & difficulty not provided
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Methods provided in enough detail for others to replicate

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • have multiple images per patient and randomly divided into various sets thus images from same patient could be in same set but not treated independently
    • tables 1 & 2 need statistical comparison to test for significant differences
    • visualization results not very objective
    • lesion characteristics & difficulty not provided
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Not much new here overall on well studied topic

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The paper proposes a multi-task U-Net architecture to jointly solve the tasks of MC clustering and segmentation and classification to “benign” or “malignant”. The network is trained jointly with the corresponding loss functions for each task. The paper also proposes a “self-adversarial” loss to use the same backbone for classifying the erased image patches to “benign” for helping the network to better capture malignant MC cluster shapes rather than small discriminative regions. Using the collected annotated data, the experiments in the paper suggest that each of the components (related to classification, segmentation, clustering, and the adversarial part) in the training objective has a positive effect on the MC classification and clustering.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has the following strengths:

    +The paper formulates a multi-task problem to better solve MC clustering compared to the traditional methods

    +The proposed “self-adversarial” objective that uses the same backbone seems to be of vital importance to capture the MC cluster shapes

    +Visualized results show the robustness of the network performance even for very small MC clusters and shows how the proposed method can provide more accurate contours compared to the method only based on segmentation

    +The MC clustering results are compared against relevant ablations made by the authors, showing the effect of each component of the objective function

    +Using relevant metrics for MC clustering and malignancy classification support the evaluation procedure used in the paper

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I think the paper could been improved by taking into account the following notes and elaborating on them:

    • Evaluations are made on a private dataset collected by the authors. How does the network perform if it is evaluated on external datasets such as DDSM [1] and BCDR [2]?

    • How does the network performance compare against other works in the literature (e.g. [3,4])? Also, how does it compare to an off-the-shelf instance segmentation network, e.g. Mask RCNN?

    • The paper proposes an “erasing” mechanism to provide input patches for the discriminator. Since all of the input patches will be labelled as “benign”, does it make the dataset imbalanced? If so, are there any other measures taken to deal with this?

    • The paper shows the “Our-Seg + Our-CAM” model has the best performance for MC clustering. Can you explain in more detail how the segmentation results are added to CAM results? And how do you combine them when there is disagreement between the two results (e.g. some high confidence cluster contours from segmentation results are not in the same reign as CAM)? Does it increase the false positive rate?

    [1] The Digital Database for Screening Mammography [2] Discovering Mammography-based Machine Learning Classifiers for Breast Cancer Diagnosis [3] Microcalcification detection in full-field digital mammograms: A fully automated computer-aided system [4] Global detection approach for clustered microcalcifications in mammograms using a deep learning network

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I believe some aspects with respect ot reproducibility could have been improved in the paper:

    • In the paper, it is mentioned that the final checkpoint is chosen based on the evaluation performance. However, it seems unclear to me how the evaluation metrics explained in the paper were combined to choose a checkpoint. Is the checkpoint chosen has the best MC clustering performance (related to Table 1) or best classification performance (Table 2), or a combination of both? -Which deep learning framework has been used in this work? -Which GPU model and how many of them have been used in this paper?
    • Will the code/data/models be shared?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    It would have been nice if some more implementation details had been provided.

    • For instance, given the criteria for creating patches, how many malignant/benign training and validation patches were generated?
    • Given the criteria for patch selection, is the train/validation data imbalanced? If so, what measures are taken to deal with that?
    • Could it be possible to show the grid search results for finding good weighting factors of the loss function?
    • Are the erased patches fed to the network in the same batch as the original input patches (before erasing) when training the model?

    Other notes:

    • The use of the word “discriminator” here is confusing. In the paper, the “discriminator” is part of the backbone that is used to classify erased patches into “benign.” Readers will be expecting the disciminator to be used in a minimax setup such as in a GAN. Is there any motivation to call this part a “discriminator”?
    • On the first line of Section 3.2, the Adam paper [5] could have been cited
    • On the second line of Section 3.3, “we” should be changed to “We”
    • On multiple occasions, the authors refer to losses as “constraints”, which will confuse readers. In optimization, constraints usually refer to e.g. limitations in the search space.
    • Several typos, including “copes” which I suppose was meant to be “copies” but should rather be written as “shared parameters” … and “bottle layer” which should be bottleneck

    [5] Adam: A Method for Stochastic Optimization

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a multi-task network for classification and clustering of MC. Although the model proposed by the authors seems to benefit from the novel “self-discriminator” task and has a good performance in the dataset provided by the authors, it is not shown how it performs in other datasets and how it compares against other available methods in the literature (e.g. those mentioned in the previous part or the ones referenced by the authors in the paper). The results in the paper suggest that the self-discrimination task along with CAM can help the network identify the MC clusters more accurately than previous methods, but this has not been supported by the experiments. The same could have been done in the visualized results to see how other methods perform in the hard scenarios with very small MC clusters and what their weaknesses are.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The authors focus on a clinically relevant task of detection of microcalcification in the context of breast cancer screening and propose a method combining multiple tasks, namely segmentation, classification, attention, and self-adversarial learning. The proposed weighted combination of the proposed tasks allows for impressively high performances compared to the baseline. A decent amount of ablation studies is done to illustrate the benefits and the effects of each component of the proposed framework. Disregarding the evaluation that is done on a private dataset, the claimed results appear to be promising, i.e., sensitivity as high as 0.98 with 0.6 false positives per image.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method efficiently combines several approaches from the state of the art [1-3] using a loss function weighing the contribution of each component. Based on the presented experiments, it leads to a noticeable gain in performances compared to a baseline segmentation-only approach and gradually improves with the addition of the proposed steps.

    Overall the paper is well written and contains a relatively clear presentation of the method and evaluation, as well as hyper-parameters and post-processing being used.

    [1]Choukroun, Yoni, Ran Bakalo, Rami Ben-ari, Ayelet Askelrod-ballin, Ella Barkan, and Pavel Kisilev. 2017. “Mammogram Classification and Abnormality Detection from Nonlocal Labels Using Deep Multiple Instance Neural Network.” Eurographics Proceedings. https://doi.org/10.2312/VCBM.20171232. [2]Wu, Eric, Kevin Wu, David Cox, and William Lotter. 2018. “Conditional Infilling GANs for Data Augmentation in Mammogram Classification.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 11040 LNCS. https://doi.org/10.1007/978-3-030-00946-5_11. [3]Zamir, Roee, Shai Bagon, David Samocha, Yael Yagil, Ronen Basri, Miri Sklair-Levy, and Meirav Galun. 2021. “Segmenting Microcalcifications in Mammograms and Its Applications,” February, 101. https://doi.org/10.1117/12.2580398.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I would note a few weaknesses. Many of the state-of-the-art detection and segmentation methods [1-4] are trained and/or validated against publicly available datasets. Some of the works [2,3] claim a similarly high level of performance compared to the results presented in the paper. The use of private datasets alone does not allow for a fair comparison to the other state-of-the-art methods [3].

    The authors compose the dataset that is composed of MC-containing cases only (as it stands from 3.1). Therefore, the overall evaluation without benign cases appears to be biased.

    Finally, the proposed evaluation lack some clarity. That is, while the training is performed patch-wise, it is not clear, whether the full-sized images are fed into the network at inference time (as it stands from Introduction, page 2 “delineate the MC clusters at image level in one shot”) and how the classification and detection metrics are generated in case of multiple true positive regions per image.

    [1]Jung, Hwejin, Bumsoo Kim, Inyeop Lee, Minhwan Yoo, Junhyun Lee, Sooyoun Ham, Okhee Woo, and Jaewoo Kang. 2018. “Detection of Masses in Mammograms Using a One-Stage Object Detector Based on a Deep Convolutional Neural Network.” PLoS ONE 13 (9): e0203355. https://doi.org/10.1371/journal.pone.0203355. [2]Ribli, Dezso, Anna Horváth, Zsuzsa Unger, Péter Pollner, and István Csabai. 2018. “Detecting and Classifying Lesions in Mammograms with Deep Learning.” Scientific Reports 8 (1): 4165. https://doi.org/10.1038/s41598-018-22437-z. [3]Zamir, Roee, Shai Bagon, David Samocha, Yael Yagil, Ronen Basri, Miri Sklair-Levy, and Meirav Galun. 2021. “Segmenting Microcalcifications in Mammograms and Its Applications,” February, 101. https://doi.org/10.1117/12.2580398. [4]Zhang, Fandong, Ling Luo, Xinwei Sun, Zhen Zhou, Xiuli Li, Yizhou Yu, and Yizhou Wang. 2019. “Cascaded Generative and Discriminative Learning for Microcalcification Detection in Breast Mammograms.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:12570–78. IEEE Computer Society. https://doi.org/10.1109/CVPR.2019.01286.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors provide a decent amount of details on the experimental setup:

    • the dataset composition is defined
    • the training setup, including learning parameters and epochs, is described
    • the detection results are given with several thresholds (i.e., 0.2, 0.4, and 0.6 false positives per image)
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. In Section 2, the authors state “random cropping of … image patches” for each training sample. Since this could result in a significantly unbalanced dataset, it would be useful for the reader, if the authors could provide more details on the strategy of generating the patches from the whole image (and whether it is the case). Moreover, it would be useful to know, if any masking strategy is applied [1].

    2. In 3.4 as well as in Table 1 the authors provide results for “U-Net” and “Ours-Seg”. Could the authors clarify the difference between the two more explicitly (i.e., if it is related to GC layers only)?

    3. Similarly in Table 1, there are several settings for “Multi-task U-Net” in the CAM section, separated from the Segmentation section, which is slightly confusing. That is, from the description in 3.4, the “Multi-task U-Net w/o sa” is still trained with l_seg. It would be helpful if the authors could clarify it and state explicitly, in which case only the losses vary, and in which cases some post-processing is involved (as in Ours-Seg+Ours-CAM). I might suggest a table with a column per loss l_i to show which loss is activated (and eventually its weight). Moreover, I would suggest a more explicit definition of the combination of Ours-Seg and Ours-CAM, whether it is a sum or something different, and what thresholds are chosen.

    4. In Table 1 and Table 2 the results suggest that the key component in the proposed framework is benign self-attention as it brings the biggest improvement, however, earlier in 3.2. the reader can see that the \lambda_5 loss has the lowest weight. Could the authors discuss this phenomenon and, if available, provide the results of the weights selection (e.g., what are the results with \lambda_i = 1, \forall i \in {1,5}).

    5. In 3.2 the authors state “we randomly sample the 512 × 512 image patch from the entire mammography images”. Similarly to the first comment above, could the authors provide more details on the used strategy, as well as provide details on the inference-time processing.

    6. In Comparison of Classification Performance in 3.4 the authors state processing on patch-wise level (“total 305 patches”). It would be useful if the authors 1) could specify the distribution of the dataset (i.e., number of benign and malignant patches) and 2) could state “patch-wise” in the caption of Table 2 to prevent confusion and save time to fast readers who are looking on the numbers first.

    7. In the Visualization results in 3.4, there are some methods (Ours-Seg, “CAM w/o sa”) that appear to have considerably lower detection, while the Ours-CAM is close to perfect. The absence of false positives is surprising. Could the authors precise, whether any thresholds are involved (as it stands from earlier “We add the detected cluster contours with high confidence”).

    8. It would be useful if the authors could discuss the existence (or absence) of cases where the proposed method fails, in particular concerning false positives.

    [1]Choukroun, Yoni, Ran Bakalo, Rami Ben-ari, Ayelet Askelrod-ballin, Ella Barkan, and Pavel Kisilev. 2017. “Mammogram Classification and Abnormality Detection from Nonlocal Labels Using Deep Multiple Instance Neural Network.” Eurographics Proceedings. https://doi.org/10.2312/VCBM.20171232. [2]Jung, Hwejin, Bumsoo Kim, Inyeop Lee, Minhwan Yoo, Junhyun Lee, Sooyoun Ham, Okhee Woo, and Jaewoo Kang. 2018. “Detection of Masses in Mammograms Using a One-Stage Object Detector Based on a Deep Convolutional Neural Network.” PLoS ONE 13 (9): e0203355. https://doi.org/10.1371/journal.pone.0203355. [3]Ribli, Dezso, Anna Horváth, Zsuzsa Unger, Péter Pollner, and István Csabai. 2018. “Detecting and Classifying Lesions in Mammograms with Deep Learning.” Scientific Reports 8 (1): 4165. https://doi.org/10.1038/s41598-018-22437-z. [4]Wu, Eric, Kevin Wu, David Cox, and William Lotter. 2018. “Conditional Infilling GANs for Data Augmentation in Mammogram Classification.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 11040 LNCS. https://doi.org/10.1007/978-3-030-00946-5_11. [5]Zamir, Roee, Shai Bagon, David Samocha, Yael Yagil, Ronen Basri, Miri Sklair-Levy, and Meirav Galun. 2021. “Segmenting Microcalcifications in Mammograms and Its Applications,” February, 101. https://doi.org/10.1117/12.2580398. [6]Zhang, Fandong, Ling Luo, Xinwei Sun, Zhen Zhou, Xiuli Li, Yizhou Yu, and Yizhou Wang. 2019. “Cascaded Generative and Discriminative Learning for Microcalcification Detection in Breast Mammograms.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:12570–78. IEEE Computer Society. https://doi.org/10.1109/CVPR.2019.01286.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the paper is clear and pleasant to read. The addressed task is clinically relevant, and the proposed method appears to improve the state-of-the-art performances. However, the lack of clarity and discussion on the evaluation, i.e., how the full images are fed into the network (and whether the full images are actually fed), and how the output is interpreted later (i.e., any thresholds involved) prevent from firmly accepting the paper. A clear discussion and explicit statements concerning the experimental and validation setup, as well as the method’s limitations would allow accepting the paper.

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a self-adversarial learning method to detect and segment microcalcifications clusters on mammograms. The method is based on a multi-task U-Net to solve the tasks of MC clustering, segmentation, and classification. All reviewers recommended the paper to be accepted given its novelty and results. There are a few points that I recommend the authors to address prior to publication: 1) add significance tests on tables 1 & 2; 2) can the paper show a comparison with the methods suggested by reviewer 2?; 3) how is how the segmentation result combined with CAM?; and 4) how does the method perform in publicly available datasets?

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

N/A



back to top