Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Akash Parvatikar, Om Choudhary, Arvind Ramanathan, Rebekah Jenkins, Olga Navolotskaia, Gloria Carter, Akif Burak Tosun, Jeffrey L. Fine, S. Chakra Chennubhotla

Abstract

High-risk atypical breast lesions are a notoriously difficult dilemma for pathologists who diagnose breast biopsies in breast cancer screening programs. We reframe the computational diagnosis of atypical breast lesions as a problem of prototype recognition on the basis that pathologists mentally relate current histological patterns to previously encountered patterns during their routine diagnostic work. In an unsupervised manner, we investigate the relative importance of ductal (global) and intraductal patterns (local) in a set of pre-selected prototypical ducts in classifying atypical breast lesions. We conducted experiments to test this strategy on subgroups of breast lesions that are a major source of inter-observer variability; these are benign, columnar cell changes, epithelial atypia, and atypical ductal hyperplasia in order of increasing cancer risk. Our model is capable of providing clinically relevant explanations to its recommendations, thus it is intrinsically explainable, which is a major contribution of this work. Our experiments also show state-of-the-art performance in recall compared to the latest deep-learning based graph neural networks (GNNs).

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_14

SharedIt: https://rdcu.be/cyl9T

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

Paper proposes an end-to-end computational pathology model that can imitate pathologists search patterns where they look at an entire duct (global) and then patterns within portions of the duct (local) to generate mental associations with similar ducts and/or parts (prototypical) previously encountered thus provide computational diagnosis of atypical breast lesions with explanations by measuring relative importance of prototypes (both global and local) for the diﬀerential diagnosis of breast lesions.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Very novel approach to difficult clinical problem
- Actually relies on observation of how pathologists approach diagnostic process & has computational system incorporate into its strategy
- Methods nicely detailed
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Number of slides relatively low (n = 93) & even though use ROIs from each not clear if analyses taken into account still same core set of cases where all ROIs are correlated with each other
- Data in table 2 need to be analyzed statistically for significant differences Need to discuss study limitations
- Would like to see better description of case selection criteria
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Very detailed methods so should be reproducible from that perspective. Would like to see better description of case selection criteria as that could impact reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- Number of slides relatively low (n = 93) & even though use ROIs from each not clear if analyses taken into account still same core set of cases where all ROIs are correlated with each other
- Data in table 2 need to be analyzed statistically for significant differences Need to discuss study limitations
- Would like to see better description of case selection criteria
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Nice study on important topic & there is nice novelty to approach. well written, just needs minor additions.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

In this manuscript a method to classify atypical lesion in breast tumor is presented. The method is based on the building of prototypes that are obtained by using pathologist knowledge, acquired in their training or clinical practice. The prototype is built taking into account both global and local features. Then a model is trained in order to learn the relative importance of global and local features.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper is the building of a prototype to capture and insert pathologist knowledge in a machine learning model. In particular, the prototype is built by using 16 diagnostically relevant histological patterns following the guidelines presented in the WHO classification of tumors of the breast.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of the paper is the limited comparison with the state of the art. Why a GNN network is used for the comparisons? Why didn’t the authors use even a simple pretrained CNN for comparisons? Moreover, the results in terms of F-measure are not very encouraging.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

A lot of details about reproducibility are present in the paper. However, information about the number of iterations and about the criteria to stop the training are not provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
A method for classification of atypical lesions of breast cancer is proposed. The method is based on the building of prototypes that are obtained by using pathologist knowledge, acquired in their training or clinical practice. The prototype is built taking into account both global and local features, by using 16 diagnostically relevant histological patterns following the guidelines presented in the WHO classification of tumors of the breast. The proposed model is then trained in order to learn the relative importance of global and local features. The paper is well written and well organized, but unfortunately, there are some points that should be improved. In details:
- the authors insert usual ductal hyperplasia (UDH) in Normal class, but UDH is a benign lesion, therefore it would be better to substitue the class “Normal” with “Others”;
- the “Related work” section should be improved, introducing more papers that deal with atypical lesions;
- in the section “Methodology” the introduction of a scheme of the method could be hel the reader to better understand all the pipeline;
- more comparisons should be introduced in the section results, for instance, comparisons with a simple pretrained CNN as AlexNet, ResNet. In this way, it might also be possible to overcome the problem of not encouraging results in terms of F-measure;
- more information about hyperparameters used for the training would be appreciated;
- In the “Results and Discussion” section, the sentence “We generate prediction probabilities p by first applying a tanh activation to htest and then projecting it to the positive octant. If p ≥ 0.5, the diagnostic label is 1 and 0 otherwise.” is not very clear. If the classes are four, why the diagnostic label is 1 or 0? Please, clarify this point.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A lot of points are not very clear and could be improved to emphasize the strong points of the method. In the present version, the paper could be rejected, but since the method is interesting, if the suggestions given in the previous section were followed, it is also possible that the paper could improve significantly.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The paper addresses the task of histopathology image classification in the context of breast cancer. The authors describe a trainable method extracting global and local features of findings in histopathology images. The method relies on the analytical modeling of the patterns used by pathologists. The learned patterns allow then for binary imaging classification of the images. The method is evaluated on an unbalanced dataset and the claimed results show the superiority of the proposed state of the art.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addresses a clinically relevant task as breast cancer is an important public health issue. The authors approach the task of image classification through the detection of analytically modeled and trained patterns, which align with state-of-the-art [1] and refreshing in the era of the convolutional neural networks. Moreover, relying on the patterns that are familiar to pathologists allows for a more explainable model, and, therefore, eventually makes it more appealing to the users. Additionally, the limited number of parameters allows more resource-efficient training.

[1] Parvatikar A. et al. (2020) Modeling Histological Patterns for Differential Diagnosis of Atypical Breast Lesions. In: Martel A.L. et al. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science, vol 12265. Springer, Cham. https://doi.org/10.1007/978-3-030-59722-1_53
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Two main weaknesses stand out from my point of view. First, the paper lacks clarity of presentation (in organization and descriptions): the transitions do not easily flow (e.g., steps descriptions in 3.2), some chapters appear to combine too many subjects (e.g., Results and Discussion section contains the description of experimental setup). The equations and formulas presentation is not quite clear (e.g., Eq.2 does not appear to be formatted as latex equation, it is unclear whether \sigma(\cdot) is tanh or tanh is an operation over \sigma(\cdot), i.e., \tanh(\sigma(\cdot))). All this makes the reading more difficult. Second, the validation process appears to contain some randomness. That is, the proposed three prototype sets appear to be finding-specific (i.e., HR, ADH, FEA), moreover there is no clear winner between global and local features, and both combined. This complicates the comprehension of the superiority of the proposed method.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors give the values of the hyper-parameters being used in the work

The authors describe the composition and the stratification of the dataset

The authors provide standard deviation of the results using randomized image selection over multiple training runs

The authors discuss the limitations and show examples of the failure of the methods.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. Overall throughout the method description, there are multiple hyper-parameters defined within the text (e.g., cribriform pattern parameters in 3.2, threshold value later in Step 3). By the end, it is hard for the reader to follow, which parameters are fixed and which are trainable. Could the authors please rectify it, with more explicit explanations (e.g., one or a few tables would be helpful)?
2. In 3.1, in the explanation of Eq.2 the authors say “We use a tanh(\sigma)…”, while tanh is not included in Eq.2. Could the authors make it clearer whether \sigma stands for tanh or not? Moreover, later in 4.2, the authors discuss a threshold on tanh (“If p >= 0.5, the diagnostic label is 1 …”). Could the authors discuss the motivation to use such a threshold (0.5) on tanh which lies in {-1, 1}?
3. In “Analytical model of a cribriform pattern” of 3.2 several hyperparameters are introduced (i.e., MoG, gamma distribution, uniform distribution). To facilitate the comprehension, as well as to save reader time from looking into Zhou et al. could the authors discuss/explain better how these parameters are obtained.
4. In Step 2 in 3.2, the c_k(x) and f_k(x) are introduced. However, the inline definition of c_k and literal definition of f_k are better to be replaced by a more formal definition for clarity. Moreover, f_k does not appear from text to be explicitly defined. Could the authors fix it?
5. In 4.1. the authors introduce three prototype sets PS1-PS3. However, the differences between the sets do not stand out. A more detailed discussion of the actual differences between the datasets would be helpful to the reader.
6. In Classification results in 4.4 the authors use Recall and weighted F-measure. Could the authors discuss/argue against the use of AUC score which might be handier for an unbalanced dataset?
7. Similarly in 4.4., could the authors clarify whether the weighted F-measure is the F1 or other?
Observations on the organization of the paper
1. Section 3.2 contains the descriptions of the steps used in the method. The formatting of the titles of paragraphs (e.g., “Step 1”, “Analytical model of a cribriform pattern”, “Step 2” etc.) is identical and, therefore, confusing the comprehension of each step. Could the authors separate the descriptions of each step in a more clear way?
2. Section 4 entitled Results and Discussion contains a big part of the experimental setup description, which is confusing for the reader. Could the authors split the parts more clearly?
3. In Classification performance in 4.4 there is a part of the text that logically relates to the Discussion section (“We also observe that baseline models are performing better on detecting Normal ROIs…”). Separating the parts more clearly would facilitate comprehension.
Minor observations:
1. Eq.2 does not appear to be formatted as latex equation, preventing reading from flowing. Could the authors fix it?
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper presents an interesting approach of trainable modeling of the patterns in histopathology images, the results are questionable, as they appear to have some randomness between the Local vs. Global features and between the proposed prototype sets. Moreover, the choice of metrics (i.e., sensitivity and weighted F-score (I believe it should be F1 score)), is unclear, as an AUC might be a more popular metric for the unbalanced datasets. Overall, the paper lacks clear organization and descriptions, giving the reader some difficulties to walk through it. I would suggest a substantial review for the acceptance.
What is the ranking of this paper in your review stack?

8
Number of papers in your stack

8
Reviewer confidence

Very confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a classification method that relies on the patterns that are familiar to pathologists allowing a more explainable model. The approach is novel however the paper needs to address the weaknesses stated by the reviewers.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Author Feedback

All (R)eviewers and Meta-Reviewer agreed that our paper is well-motivated, addresses a challenging diagnostic task, explainable to pathologists, and has novelty.

[R1-2] Cohort selection and labeling: [R1] We collected a cohort of 93 WSIs from a local hospital which were labeled by an expert pathologist on the team to contain at least one ADH ROI (Section(S) 4.1-‘Dataset’). This sentence will be added to the final version for clarity. [R1] Train and test sets were separated at WSI level to overcome the effect of correlated ROIs in the study. [R2] The diagnostic class ‘Normal’ used in our study includes usual ductal hyperplasia and very simple non-columnar ducts (S1-‘Our Approach’). [R2] We evaluate and compare all classification methods by implementing a one (positive class)-vs-rest (negative class) classifier for five models: high-risk vs low-risk, ADH-, FEA- (Table 2), CCC-, and Normal-vs-rest (Supp. (SI) Table 1). A positive class is assigned a diagnostic label “1” and a negative class gets assigned a diagnostic label “0” (S4.2).

[R2] Related work: Diagnosing atypical benign breast lesions remains a very challenging task, acknowledged by recent studies [4,8,10], and computational approaches are beginning to emerge to address this problem [12,13,18,19] (S2).

[R3] Choice of prototype sets (PS): ROIs belonging to PS1-PS3 were inspected by an expert pathologist on the team to confirm adequate diagnostic variability (S4.1). Indeed, as discussed in S4.5-‘Limitations’, we are currently pursuing a more sophisticated approach for selecting prototypes.

[R2-3] Hyperparameter tuning: [R2] Information about the number of iterations (~10000) and stopping criterion (tol=1e-3) are provided in SI Fig.1 and S4.2 respectively. [R3] The hyperparameters used for model training are provided in S4.2. For anonymity, a table of hyperparameters for modeling histological patterns, which was already published in [18], was not included in this manuscript. [R3] \sigma refers to tanh activation which was rescaled to range between 0 and 1 (S4.2).

[R3] Choice of performance metric: We evaluated the performance of all models using recall and weighted F-scores (wF) (S4.4-‘Performance metrics’ for more details and justification). wF is computed using sklearn’s ‘f1_score’ function with ‘weighted’ average.

[R1-3] Analysis of classification results: [R1] We reported the statistical significance of our results (p<0.01) in S4.4-‘Classification performance’ and discussed study limitations in S4.5. [R2] A higher wF and low recall of baseline methods shown in Table 2 indicates that they are more successful in classifying ROIs based on the majority class instead of correctly recognizing each diagnostic category. [R3] Our proposed ML model is designed to investigate the importances of diagnostically relevant global and local information independently and together, and how this might help reduce the discordance among pathologists in differentially diagnosing breast lesions.

[R2] Choice of baseline method: We previously showed superior performance of our approach with deep-learning architectures such as AlexNet [18]. GNN has recently emerged as the state-of-the-art method for exploring spatial organization of tissue architecture [13], a key element of the diagnostic process by expert pathologists. Hence, we used [13] for baseline comparison.

[R3] Clarity and Organization of the paper: We are grateful for the suggestions and happy to address the minor editorial revisions in the final version.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors say that their proposed method is already compared to AlexNet which is already published in [18]. This may raise issues with the novelty. Also the answer to comparison to GNN is not satisfactory.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

14

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I believe that the rebuttal addressed all the comments from the reviewers, whilst the reviewers 2&3 scored low but with less substantial critical comments. The paper as is is a bit cramped but presents a piece of solid work with sound motivation and solid experiments.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper proposes a very novel idea for breast lesion assessment. The reviewers pointed out several issues which mostly involve clarity which I feel the rebuttal addresses and an updated version of the paper can include.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

back to top

Prototypical models for classifying high-risk atypical breast lesions