Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Ashkan Khakzar, Sabrina Musatian, Jonas Buchberger, Icxel Valeriano Quiroz, Nikolaus Pinger, Soroosh Baselizadeh, Seong Tae Kim, Nassir Navab

Abstract

Convolutional neural networks are showing promise in the automatic diagnosis of thoracic pathologies on chest x-rays. Their black-box nature has sparked many recent works to explain the prediction via input feature attribution methods (aka saliency methods). However, input feature attribution methods merely identify the importance of input regions for the prediction and lack semantic interpretation of model behavior. In this work, we first identify the semantics associated with internal units (feature maps) of the network. We proceed to investigate the following questions; Does a regression model that is only trained with Covid-19 severity scores implicitly learn visual patterns associated with thoracic pathologies? Does a network that is trained on a weakly labeled dataset (e.g. healthy, unhealthy) implicitly learn pathologies? Moreover, we investigate the effect of pretraining and considering data imbalance on the interpretability of learned features. We also observe the formation of concepts during training. In addition to the analysis, we propose semantic attribution to semantically explain each prediction. We present our findings using publicly available chest pathologies (CheXpert, NIH ChestX-ray8) and COVID-19 datasets (BrixIA, and COVID-19 chest X-ray segmentation dataset). The code is publicly available.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_47

SharedIt: https://rdcu.be/cyl4H

Link to the code repository

https://github.com/CAMP-eXplain-AI/CheXplain-Dissection

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

Authors apply a well-established method of explaining the “black-box” (neural network) to the regression network trained on COVID-19 severity estimation task; and to the classification network trained on “healthy”/”unhealthy” lungs.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors carefully analyze the influence of different pre-training regimes and different loss functions on the number of emerging semantic concepts.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Paper is poorly written, which sometimes makes it hard to understand the authors’ ideas. For example, the central term of the paper, “semantic/individual concept detector,” is never formally introduced in the text.
2. Some of the authors’ claims raise unanswered questions. For example, why is it important to have a large number of specific concept detectors (e.g. lungs)? Why having a few is not enough? Why the number of detectors clearly associated with the solving task decrease during training?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors use publicly available dataset and provide code to reproduce their results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. From the text, it is not clear if feature maps are stable. In a sense that if some feature map appears to be a lung detector for train image, it will also be a lung detector for the test image.
2. Since you use a pretty low IOU threshold for a feature map to be considered a detector, it might be useful to show “negative examples”, false-positive activations of a semantic detector.
3. It would be useful to report a) the percentage of “concept detectors” out of all feature maps b) distribution of IOU scores. The latter could be useful to determine a “natural cut-off” for the feature map to be considered as a semantic detector (instead of using predefined 0.04).
4. Some typo corrections: p 4. “activation map is first transformed in to” -> into multiple dot-commas instead of dot-dots or simply dots, e.g. p 2. “principal categories of methods for this purpose; Methods that”, either use “this purpose. Methods that…” or “this purpose: methods that…”
5. Formula (1), this loss is simply called WCE (weighted cross-entropy) or sometimes WBCE (weighted binary cross-entropy).
6. Consider adding an explicit explanation for the term “individual concept detector”.
7. How does classification/regression network performance drop (or possible rise) if you only use selected semantic concept detectors and drop out other network units (from the corresponding layer) during inference? Could it be a way of regularizing the network after several training steps?
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The results are promising. However, the applicability of the method is not clear and text content and structure should be reworked.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Not Confident

Review #2

Please describe the contribution of the paper

The authors proposed an application of a Network-dissection paper, to tag hidden units of a convolution neural network with semantic concepts. The definition of concepts is derived from bounding box annotation of NIH Chest X-ray8 dataset and semantic segmentation from COVID-19 datasets. The paper shows extensive experiments to use such an explanation to show benefits for different DL model designs such as deciding which dataset to use for model pre-training, weighted vs not weighted loss functions, interpretability benefits of different levels of supervision, and others.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper demonstrates that pre-training on domain-specific data such as lung x-ray dataset helps improve the interpretability of CNN model as more concept’s detectors were identified.

The paper demonstrates that weighted cross entropy loss helped improved interpretability of CNN model without any significant gain in prediction performance.

The experiments for regression loss demonstrated higher supervision resulted in more semantic concepts.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The model relies on annotated data in the form of bounding box or semantic segmentation to provide explanation. Such annotated data is difficult to obtain especially in medical domain.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors didnot provide much details on reproducibility
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The inference from Fig.4. is not clear. In Fig.4. (a), the number of detectors for GGO are changing a lot. Does this reflect model is uncertain about the detection?

It’s not clear if the proposed method provides a way to rank hidden units based on their importance. For example, if a hidden unit have high IoU for multiple concepts then this hidden unit is more important than a unit that is relevant for only a single concept. Given such ranking, the authors can design an experiment to evaluate how much prediction accuracy depends on these important neurons. For example, if we remove top k neurons how much prediction accuracy drops. A significant drop will show that, not only concept detectors exist in the intermediate representation, but these detectors also impact network accuracy. Reference: Understanding the role of individual units in a deep neural network

Fig.5. shows qualitative results for identifying top contributing unit to the prediction decision. A quantitative analysis of this finding is missing. An experiment similar to above can be done to quantify the validity of the identified most contributing hidden unit for the prediction.

Tubes and airways cover much smaller area as compared to bounding boxes for different disease. The authors should consider showing images for hidden units that detect such small features.

In the experiments, most of the individual detectors are related to Cardiomegaly. An analysis of mean area under the bounding box annotation will help the readers understand that this finding is not due to large IoU over big bounding boxes.

From the results, it’s not clear if weakly labelled data helped in terms of better concept learned by the hidden units.

There are multiple spelling and grammatical mistakes.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper provides a novel interpretability analysis for classification methods trained on chest x-ray images.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

7
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The study focuses on the model interpretation technique. Unlike the class activation maps (which are computed from the feature maps obtained from the final convolutional layer), the current study observes the internal units and finds images (or image regions) activated neurons. The authors employed the Network dissection method as an interpretation technique and applied it to CXRs for classification and regression problems.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

As far as I know, the medical image analysis researchers mainly use class activation techniques (CAM, grad CAM, or overlapping method) to interpret their neural network architectures. However, there are also claims that saliency maps do not accurately highlight the diagnostically relevant regions (https://www.medrxiv.org/content/10.1101/2021.02.28.21252634v1)

Employing a different interpretation methodology from the computer vision discipline and applying it to CXR analysis is a contribution to the community. The methodology and its application to CXR are nicely formulated and illustrated.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The study does not introduce a new technique. The authors employed a technique proposed in the computer vision discipline and applied it to CXRs.

Hard to follow up on the manuscript. The study introduces the usage of a new interpretation concept in medical image analysis, the use of the network dissection method. The following contribution claims, such as questioning the followings “does a NN regression model that is only trained with covid-19 severity scores implicitly learn visual patterns associated with thoracic pathologies”, or “studying the effect of considering data imbalance on semantic internal units”, have started to confuse the study and makes it hard to follow up. It would be much better to stick to first contribution, and propose this idea as simple and understandable as possible, and compare its advantages with the previous interpretation techniques.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The datasets used in the study (NIH CXR, Chexpert, and Brixia) are publicly available datasets. The Network dissection method (from MIT) are publicly available on GitHub. The researchers of this study also share their codes as a supplementary file (so probably, they will share it with community after the publication). The authors of this study nicely formulated the steps. They are also used DenseNet-121. The analysis can be reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

1) from study contribution list: “ Does a network that is trained on a weakly labeled dataset implicitly learn pathologies?

Is this question already investigated for CXRs in the following article? “ChestX-ray 8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases.”

Please see the following manuscript. It would be nice to compare grad-CAM and proposed interpreting technique on the same problem with the same dataset. https://www.medrxiv.org/content/10.1101/2021.02.28.21252634v1

The researchers found a nice open area, focusing on interpretation techniques, which would provide more understanding of which regions activates the black-box NN for its decision. We expect the decision is mainly because of the abnormality (manifestations) of the disease for medical image analysis problems. However, the researchers made several follow-up claims after they introduce the initial methodology, the usage of the network dissection method in CXR analysis. I would suggest authors to simplify the manuscript, stick the main contribution, and analyze its advantages and disadvantages in CXR analysis, as well as its limitations. That would be more useful for researchers who would have interested in usage of the network dissection in their medical image analysis studies. The following claims can be follow-up analyses in another study.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

As far as I know, the medical image analysis researchers mainly use class activation techniques (CAM, grad CAM, or overlapping method) to interpret their neural network architectures. Employing a different interpretation methodology from the computer vision discipline and applying it to CXR analysis would greatly contribute to the community. The methodology and its application to CXR are nicely formulated and illustrated. The analysis can be reproducible.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Somewhat confident

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors describe an approach to explainability that aims to overcome limitations of feature attribution methods. They identify the semantics associated with internal units (feature maps) of the network and apply their method to public datasets. The concept is novel and different than conventional approaches using class activation maps. However there are some key issues raised by reviewers that requires clarification: relevance to medical domain and the soundness of method description.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We thank all reviewers for their valuable feedback. All reviewers lean towards an “accept”. R2, R3, and AC point out the novelty. R3 states that our CXR interpretability analysis contributes to the community. The results are promising (R1), and the experiments/results demonstrate the effect of pretraining, imbalance loss, and supervision level on interpretability (R1,R2). The method and its application to CXR are well formulated (R3). R1 and R3 appreciate the reproducibility.

*

R1: What is an individual concept detector / a semantic unit?

A feature map (a detector - a unit) that is associated with an individual concept (e.g. consolidation) according to the method [2] in section 2.2.

*

R1 has questions/suggestions regarding the number of detectors and IoU threshold

Having a large number of detectors (or equivalently high percentage) is not important. What matters is “comparing” the number of unique detectors between models to compare their interpretability [2]. The IoU threshold affects the number of detectors similarly in all models and does not affect the comparison [2], thus using IoU distribution for selecting threshold does not affect the comparisons.

*

R1: If detector detectes a concept in train image, can it detect that concept in test image?

We do not associate feature maps with concepts using training/test data. We use “unseen” data from annotation datasets which is independent of train/test.

*

R1, R2 Suggest experiments on the importance of semantic units

The computer vision work referenced by R2 shows semantic units are important. Such an experiment can be an insightful addition as it shows which pathologies are important for the model in the dataset.

*

R2: The method relies on annotated data, which is difficult to obtain in the medical domain:

We show it is feasible to combine annotations from multiple datasets and use them as a single annotation dataset. Moreover, with the growing number of public datasets, their combination would result in a more comprehensive annotation dataset. Nevertheless, our work at its current state demonstrates that it is still feasible to do the analysis on CXRs and derive novel insights.

*

R2: Quantitative analysis to check if units in Fig. 5 are important

Such analysis is not needed as we already select these units based on their high contribution to prediction, which is computed by our method in sec 2.3.

*

R2: Do fluctuations in the number of GGO in Fig. 4 reflect uncertainty?

We cannot infer probabilistic uncertainty. One conclusion from Fig. 4 is that concept detectors emerge during training (though the number fluctuates for some).

*

R2: Did weakly labeled data help

We observe that semantic detectors emerge in models trained on weak annotations (healthy/unhealthy).

*

R2: Most detectors are Cardiomegaly. Is it due to large bounding boxes?

In Fig. 2, other pathologies (consolidation, infiltrate) also have equally high numbers. Furthermore, left lung with large labels has relatively few detectors. Thus, we cannot make such a conclusion.

*

R3: Is “Does a network that is trained on a weakly labeled dataset implicitly learn pathologies” investigated in [23]?

No. [23] trains the classification model on pathology labels and then performs “weakly supervised localization”. They use the “weak” term due to localization without training on bounding boxes. We use “weak” due to having weaker classification labels (healthy/unhealthy) instead of pathology labels.

*

R3 considers applying [2] to CXR as a contribution and novel, however, also states that the technique is not novel

We also have technical novelty, which is our semantic attribution method (sec 2.3). Nevertheless, the analysis and insights are our main contributions.

*

R3: Difficult manuscript. Better investigate questions (on pretraining,..) in follow up work

We appreciate the suggestion. In this work, our main contribution is using [2] for investigating these questions (acknwoledged by R1, R2).

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have clarified the main misunderstanding around the method description and terminology. In the final version the authors should consider adding experiments suggested by R1,R2.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a mostly positively reviewed paper by three reviewers. I agree with the following statement from the primary AC’s meta review: “They identify the semantics associated with internal units (feature maps) of the network and apply their method to public datasets. The concept is novel and different than conventional approaches using class activation maps.” Author did a good job on clarifying some concerns based on the original reviews.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors tackle an important problem. While the degree of methodological novelty may be limited, the analysis and insight is very worthwhile. I especially like the analysis of visual feature attribution when performing simple healthy vs unhealthy classification, which is a very interesting question as to whether the classifier actually picks up disease-specific features.

Minor: –best to describe the class as disease patterns rather than pathologies - one cannot diagnose a specific disease from CXR –I would avoid the term weakly labeled for healthy vs unhealthy as it can be confusing since that terms commonly means something else. Perhaps “coarsely-labeled”? –Plenty of typos - as this is extremely easy to catch with a spell checker, it is particularly egregious.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

back to top

Towards Semantic Interpretation of Thoracic Disease and COVID-19 Diagnosis Models