Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yoel Shoshan, Aviad Zlotnick, Vadim Ratner, Daniel Khapun, Ella Barkan, Flora Gilboa-Solomon

Abstract

Detecting the specific locations of malignancy signs in a medical image is a non-trivial and time-consuming task for radiologists. A complex, 3D version of this task, was presented in the DBTex 2021 Grand Challenge on Digital Breast Tomosynthesis Lesion Detection. Teams from all over the world competed in an attempt to build AI models that predict the 3D locations that require biopsy. We describe a novel method to combine detection candidates from multiple models with minimum false positives. This method won the second place in the DBTex competition, with a very small margin from being first and a standout from the rest. We performed an ablation study to show the contribution of each one of the different new components in the proposed ensemble method, including additional performance improvements done after the competition.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_74

SharedIt: https://rdcu.be/cyl6R

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes methods to detect breast lesions, provide location, size of a bounding box & confidence using DBT cases. Was part of the DBTex Grand Challenge thus used those cases & performed well.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Used large standard data set in major competition thus comparisons are all on same images
- Combines 2D model predictions to produce 3D boxes prediction
- Ablation study was performed on contribution diﬀerent components of algorithm, and compared to state-of-the-art methods in the ﬁeld
- Novelty is pretty good especially the 2D & 3D use
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Are results in Table 1 statistically significant?
- would like to see breakdown of types of lesions, sizes, difficulty etc.
- would like to see results broken down in terms of FNs & FPs as function lesion types etc.
- need to discuss limitations
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Methods provided in enough detail overall for others to replicate
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Are results in Table 1 statistically significant?
- would like to see breakdown of types of lesions, sizes, difficulty etc.
- would like to see results broken down in terms of FNs & FPs as function lesion types etc.
- need to discuss limitations
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Solid paper & as was in challenge used large common dataset so comparison with others justified.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The authors propose an ensemble model with a set of new components to detect lesions in DBT volumes. They have studied the impact of each proposed component and their method attained 2nd place in the recently conducted DBTex 2021 Grand Challenge.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Strengths: They have demonstrated the impact of each proposed component. The ablation study performed in Section 5.2 is strong.

The proposed model achieves the 2nd best result in the challenge.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Weakness:

Lack of clarity in the description of some components. Particularly, In section 4.2 it is not clear how the 2D prediction boxes are combined to form the 3D box. The prediction in each slice need not be perfectly aligned, so how are the consecutive slices combined to obtain the 3D box ?

Lack of evaluation on the standardized challenge dataset. The proposed method as shown in Table 1 obtains vastly improved performance, however, it it unclear how much is the contribution of additional in-house training data here. If they evaluate their method using only the publicly available data then it will be a fair comparison.

Details missing at various places regarding the training of the network. Particularly, in Section 4.1, please also mention what pre-processing steps were applied for MG and DBT slices before feeding the data into the network (zero mean, or 0 to 1 range, or no normalization etc.).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Due to a large number of tuning parameters in the proposed method, it would be preferable if an end-to-end training and evaluation code be shared, specially, since the challenge dataset is publicly available so that other authors can built upon the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The formatting needs to be corrected in Figure 1. Text formatting also needs to be improved at multiple places where paragraphs have been changed unnecessarily in my opinion.

It is unclear what kind of pre-processing steps were taken while using the Mammogram and DBT image slices in the training of networks. Data pre-processing is a crucial step and should be included.

It is unclear how the 2D findings are combined into 3D bounding box as the 2D bounding boxes in each slice might not be perfectly aligned so how were the coordinates of the 3D box obtained.

The comparison in Table 1 is not exactly fair as in the proposed method additional in-hoise data has been used while other participants might not have access to such private data. It would have been better if the results were compared with and without the additional in-house dataset.

It would be preferable if one sample result after each step is shown in Figure 2, which can greatly help in the understanding, particularly for steps 5, 6, 7, 8.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Performance wise the proposed model has good results, however, the weakness of this paper are lack of clarity in description and lack of major novelty component. Also the model’s performance needs to be checked only with the challenge dataset to get a fair sense of ranking in the challenge.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The authors use tomosynthesis images (DBTex 2021 grand challenge) to detect breast lesions and provide the location and size of a bounding box. Ground truth information was available for the DBTex train set. The authors use a variety of methods to generate candidates and introduce several algorithms to reduce false positive candidates.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Main strength of this submission is in the results. Test results are impressively ranging 86-98 % with an average of 92%.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Even though the results are impressive on technical level, propagating these results towards a clinically acceptable accuracy is still extremely needed. A clear path leading to these improvements is not provided.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is well referenced (for the parts where established methods are used). However, some choices in newly proposed methods are not provided. E.g., RBSM choice for alpha = 0.685 is not explained or motivated.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper provides a collection of methods and proposes the algorithms to combine these methods into viable lesion candidates. Even though the results are impressive on technical level, propagating these results towards a clinically acceptable accuracy is still extremely needed.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main contribution (as presented by the authors) appears to be final position in the challenge. That is overemphasised in the writing.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Three experienced reviewers evaluated the paper. Two of them recommend acceptance while the third one recommends rejection. All reviewers agree on the quality of results on a public challenge and on the adequacy of the ablation experiments to support the technical contributions. However, R1 requests a more meaningful analysis of the results, while R2 expresses serious doubts about the actual empirical improvement in absence of in-house additional data. These points should be thoroughly addressed in the rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Author Feedback

We sincerely thank you for your work and effort in reviewing the manuscript. Your comments and remarks have been very helpful to improve the quality of the paper.

Reviewer 1: Statistical significance in Table 1: We agree and provide update of table 1, containing confidence intervals and statistical significance of the differences (two-proportion z-test) of other results compared to ours (3rd place already falls outside the confidence interval of our performance): Place|Prim. Score|Sec. score 1|0.912(0.867-0.957, p=0.95)|0.912(0.867-0.957, p=0.82) 2(us)|0.91(0.865-0.955)|0.904(0.857-0.951) 3|0.853(0.797-0.909, p=0.15)|0.868(0.814-0.922, p=0.35) 4|0.814(0.752-0.876, p=0.02)| 0.831(0.772-0.890, p=0.08) 5|0.811(0.749-0.873, p=0.02)|0.801(0.738-0.864, p=0.02)

Breakdown of the results into sub-categories: This is a great idea! This information isn’t available on the test set, but some is available on validation (lesion size and biopsy result). We plan to add it as supplementary. Since it doesn’t involve retraining or inference, we are certain it will be ready for the camera-ready deadline.

Limitations: The dataset is focused on tumors and architectural distortions, and does not contain calcifications.

Reviewer 2: Usage of external training data: It’s a great idea to test our algorithm on competition data alone, but as it involves retraining multiple models we don’t have the computational resources and budget to guarantee it will be ready by the camera-ready deadline. We would like to emphasize that our main contribution – sharing the method that we created and used, and the ablation study showing the contribution of each sub parts - still holds, regardless of usage of external training data. We’d like to point out that other groups (at least the top 3) also used additional data.

Preprocessing steps: We added the following: “intensity values were linearly scaled to the range [-1.0,1.0]. Images were downsampled by a factor of 2 on xy axes, but not on slices axis.”

2D to 3D boxes process: Thank you for drawing our attention to this incomplete explanation. The 2D boxes do not need to be aligned – we choose the best ones and expand them to a pre-set depth. We modify section 4.2 as follows: a. At the end of step 1 text “Aggregation of predictions bounding boxes” - we add “Per 2D predicted box, we stored the slice it originated from, to be finally used in the 3D predicted box.” b. We change step 9 “3D Prediction boxes generation” into: “Taking the boxes that “survived” all steps up to this point, we take for each box the slice origin which we stored (mentioned in Step 1), and build a 3D prediction box by expanding the 2D prediction box along the depth axis symmetrically in two directions (+z and -z). We use a fixed size of 25% of the total volume depth, since the competition evaluation metric, by design, only considers the center slice (z axis) of the predicted 3D box when deciding whether a prediction is considered a hit or not, that is unlike xy axis in which min and max x and y in the predicted box are all considered. The rationale for this metric is that the artifacts in DBT modality only allow to properly annotate the central slice of lesions.“

Sharing code: We agree and started legal approval process.

About novelty, we respectfully disagree – prior art search did not yield any of our key novel components, e.g. the heatmap generation used to combine detectors predictions.

Reviewer 3: Regarding clinically acceptable accuracy: we provided a metric that measures sensitivity at average 1 FP per image. Our method achieves 93% sensitivity in this metric, which is above acceptable accuracy. We are aware of multiple CAD tools used in practice with significantly lower performance.

Hyper-parameter choice: We would like to add the following to clarify it: “All compared methods’ hyper-parameters were selected by a grid search on validation set, and the best performing hyper-parameters were used for testing.”

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal adequately addresses most concerns pointed out by the reviewers. The availability of public code would significantly increase result reproducibility and is strongly encouraged. The final version should include all reviewer suggestions.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors present an evaluation of their ensemblnig technique for detecting lesions from DBT. While the performance does seem quite good, I still feel that there are too many shortcomings:

–As one reviewer pointed out, the authors compare against SOTA using their own in-house training data. While the authors respond by emphasizing their ablation study, the actual article text emphasizes their performance against competitors (e.g., in the abstract stating they are in 2nd place). Relative performance against competitors that use different data is impossible to judge. –The article read to me more of a description or listing of engineering and implementation details. The reasoning behind the inclusion of steps in the list in 4.2 is not provided. Compounding this, there are so many numerical choices, e.g., filtering size, how many bboxes per detectors are kept, how many detectors (and of what type) are ensembled, the heat map filtering threshold, etc. This is just a sampling and these choices seem to be purely empirically driven, with little to no scientific insight. With so many moving parts, the likelihood of implicit overfitting to the dataset in question is increased. –Related to above, I agree with reviewers that novelty is limited. The ideas of weighting confidence scores by their rank or aggregating scores by creating a heat map based on how many bboxes overlap each pixel are not new and besides there is no insight into why this was needed. –RE: clinically acceptable accuracy, it may be true that there are solutions in the market that don’t achieve 93% sensitivity at 1 FP/image. However, the test set does not contain calcifications, and these could be a major source of FPs “in the wild”. Thus the 93% metric may not be a realistic measurement beyond the test data.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

18

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have satisfactorily addressed all major concerns and also provided addition results (significance tests) to support the obtained results. These justifications and updates should be included in the camera ready.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

back to top

Beyond Non-Maximum Suppression - Detecting Lesions in Digital Breast Tomosynthesis Volumes