Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Thomas Henn, Yasukazu Sakamoto, Clément Jacquet, Shunsuke Yoshizawa, Masamichi Andou, Stephen Tchen, Ryosuke Saga, Hiroyuki Ishihara, Katsuhiko Shimizu, Yingzhen Li, Ryutaro Tanno

Abstract

Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure cases manually, identifying failure modes and then attempting to fix the model. In this work, we aim to standardise and bring principles to this process through answering two critical questions: (i) how do we know that we have identified meaningful and distinct failure types?; (ii) how can we validate that a model has, indeed, been repaired? We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation after fine-tuning and introduce metrics to compare different subtyping methods. Furthermore, we argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data. We combine these two ideas into a principled framework for evaluating the quality of both the identified failure subtypes and model repairment. We evaluate its utility on a classification and an object detection tasks. Our code is available at https://github.com/Rokken-lab6/Failure-Analysis-and-Model-Repairment

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_48

SharedIt: https://rdcu.be/cyl4I

Link to the code repository

https://github.com/Rokken-lab6/Failure-Analysis-and-Model-Repairment

Link to the dataset(s)

https://medmnist.com/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a general framework for analyzing and the repairment of the model in medical image analysis based on machine learning. This can arise more potential works in the future.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strength:

    1. A good trial to validate the failure cases in machine learning for medical image analysis.
    2. Although the main framework is simple, the results are good to support the claims in this work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weakness:

    1. The techinical contribution seems not sufficient.
    2. More experiments are expected to evaluate the proposed method.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of this work seems OK. The hyperparameters are also presented in details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    This paper proposed a simple framework to validate the discovery of failure cases and the repairment of these cases in medical image analysis based on deep learning. The idea is interesting and provides a good innovation for future works. However, the main framework is a simple combination of intuitive metrics to identify the typical failures. For ‘independence’ criteria, the authors claimed that the failures cases should be independent to each other. However, this cannot be held in practical cases. The factors are usually related and the improvement on specific factor may help to other cases. I also expect the authors can evaluate the proposed method on more dataset since the claim of this work is very strong. The experiments on two datasets are not sufficient.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good innovation for failure identification in ML-based medical image analysis. However, the techinical contribution is not sufficient.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    The authors propose an approach for addressing failure analysis of machine learning methods. Their method standardizes and automates a process that is traditionally tedious and ad hoc. It consists of first clustering incorrect classifications and then retraining on these examples to improve classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Failure analysis is an important part of machine learning model development and automating the process would be invaluable if it worked consistently. The idea of clustering failed cases is an interesting approach to the problem, which could allow the process to be automated without the expensive manual step currently required.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness in the paper is a differentiation between training and testing data in the proposed approach. One subsection has superscripts for training and testing, but the remaining paper omits these superscripts or reference to whether the data used for fine tuning is actually training or testing data. This is an important distinction if the approach were to be applied.

    Based on the results in table 2, my impression is that the suggested fine tuning is meant to use the testing data. If this is the case, then the proposed method would lead to overfitting in the classifier since it is using direct knowledge of the testing labels in the adjustment of the classifier. Such a process could be done on a validation set, but then evaluation would need to be performed on a held out test set, which does not seem to be the case here.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will be made available and the data is publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    As mentioned above, the main issue I have is with using testing data for model fine tuning. Performance is expected to increase when testing data is used to train the model. Such a performance increase is not expected to be robust as it likely results in overfitting to the test data rather than resulting in a generalized classifier. I would suggest performing this fine tuning using a validation set and then evaluating both the original model and the fine tuned model on a held out test set to see if this process results in a truly improved classifier.

    In the example where the model was fine tuned on individual failure classes, the result would be multiple models for each of these classes (as shown in figure 2, none of these models works well across classes). It is unclear how such a model would be used prospectively as it requires identifying a failure type for the data without knowledge of the data. Would new cases first be assigned to a “failure” cluster (even if they are potentially correct) and then have that classifier applied to them?

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Proposing refinement of the model on test data poses questions about how robust the method would be in a prospective application. Evaluation would need to be performed on a held out test set to validate the proposed approach.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    4

  • Reviewer confidence

    Somewhat confident



Review #3

  • Please describe the contribution of the paper

    The authors propose an automatic method to discover a set of failure types automatically through gradient and feature space clustering, which they demonstrate can out-perform manual inspection.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is a promising framework that proposes two metrics to evaluate the meaningfulness of a set of failure types in an objective and quantifiable way. Moreover, they demonstrate that automatic clustering can recover relevant clinical failure types such as undetected catheters close to the ultrasound probe in ICE-CD or the distinction between the two malignant classes in PathMNIST.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main novelty in the proposed work is in proposing two metrics to assess failure.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors use open source data sets to test the feasibility of their method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    This is a promising piece of work that has the potential to be clinically feasible. However, the approach is not clearly defined and the paper is confusing to read. It would be helpful if example clear images of failure types were included in the paper and further details and justification of how “correct” and “failure” cases were validated.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Promising method that warrants further investigation but not clearly described or presented.

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes an automated framework for detecting and repairing failure for machine learning methods for medical imaging. The strength of the paper is the idea of the framework based on different metrics and repairment strategies. However, as presented, there are several issues that the reviewers point out. Please address all their comments, especially with respect to testing data and overfitting, clarity, highlighting what the technical contribution and novelty are.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5




Author Feedback

We thank the meta-reviewer and reviewers for their constructive feedback. Failure analysis and understanding of machine learning models is essential in practice for developing reliable medical imaging applications, yet understudied in the MICCAI community. Indeed, in the face of new failure cases, we typically employ ad-hoc procedures to understand and fix the failures. In this work, we aim to standardise & bring principles to this process by introducing the first framework for not only automatically deriving subtypes of failure cases and measuring their quality, but also repairing the models and verifying their generalisation. Finally, we will open-source the code upon publication.

The main criticisms are summarised as: (1). The need for clarifications on the technical contributions and novelty. (R1) (2). The concern that the same test data might be used to both repair and evaluate (overfitting the test data). (R2) (3). The method’s exposition is hard to understand. (R3)

Reg. (1), in addition to introducing a general framework to an under-explored but critical problem, we make several contributions that are worth highlighting. First, we put forward a set of desirable properties for meaningful failure types. Then, we propose novel surrogate metrics to assess those properties in practice. Moreover, we argue that model repairment should not only aim to fix the failures but also to retain performance on the previously correct data. We show specific instantiations of our framework and demonstrate that clustering in feature and gradient space can automatically identify clinically important failures and outperform manual inspection. Additionally, we show that repairment can be done not only by finetuning but also by continual learning methods such as Elastic Weight Consolidation (EWC) [1]. Furthermore, we share some insights such as fixing one failure type is easier than fixing them all at once, which is intuitive and useful when some failure types are higher priority than others.

Reg. (2), we do not have this issue of test data leakage and we will clarify in text. We want the repaired models to generalise and thus each failure type is split into two sets, where one is used to repair the model (e.g., finetuning or EWC) and the other is used to evaluate generalisation. Our experiments show promising results in this direction. For example, Table. 2 is derived from images that are not seen by the models during either the initial training or repairment. As R2 also points out, this misunderstanding may have been likely due to our notations to denote these sets.

Reg. (3), we will review the exposition of our method. Please see our responses to (1) and (2) for clarification on validation details, notations, and our method’s contributions. For example, in relation to (2), we will clarify how the data is first divided into failure types and then into training, validation, and testing sets using a consistent notation.

We also address other points below:

R1: Complete independence of failure types may not achievable in practice: We agree that the notion of independence is continuous, two failure types might not be 100% independent but 80% is more informative than 30% independence. We will clarify this point.

R2: How can the models repaired on individual failure types be used in practice? How about selecting the model at test time by categorising the input image?: We think that R2’s suggestion is interesting. If the process of assigning each image to a failure cluster is consistent, the selected model would have better performance than a single general model as shown in Table 2. In addition, we suggest that fixing mistakes one by one could lead to better models. We add a discussion to sec 4.3.

R3: Clearer examples of failure types should be included: We already provide examples in Fig. 3 and Fig. 4. We will add more detailed descriptions in the captions.

Refs: [1] Kirkpatrick .et al, PNAS’17.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work seems to be very interesting and relevant for the MICCAI audience. However, it seems like a lot of clarity in writing needs to be improved since all reviewers have some confusion about details. It was good to clear out the misunderstanding about the test data which R2 pointed out. However, there seems to be a lot of writing that needs to be tightened up and I do not think the paper is in a state that is ready to be published.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    11



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This submission proposes an automated approach for analyzing failure cases and identifying potential repairment of the model for medical image analysis. The reviewers agree that this is an interesting direction and that the idea has potential for future work. In the rebuttal, the authors addressed most of the concerns (e.g. differentiation between training and testing). I do agree with the reviewer that raise the issue that more experimentation would be needed to support the strong claims. However, I see the value of this submission as a great starting point, which could foster more research in this direction as it is currently “understudied in the MICCAI community” (rebuttal).

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper offers a new perspective that has been understudied in the MICCAI community - failure analysis and model repairment. While the technical innovation might not be very high in this paper, it will serve as a good pointer for more research in this direction.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    11



back to top