Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yangsibo Huang, Xiaoxiao Li, Kai Li

Abstract

Abstract.Data auditing is a process to verify whether certain data have been removed from a trained model. A recently proposed method (Liu et al.) uses Kolmogorov-Smirnov (KS) distance for such data auditing. However, it fails under certain practical conditions. In this paper, we propose a new method called Ensembled Membership Auditing (EMA) for auditing data removal to overcome these limitations. We compare both methods using benchmark datasets (MNIST and SVHN) and Chest X-ray datasets with multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our experiments show that EMA is robust under various conditions, including the failure cases of the previously proposed method. Our code is available at: https://github.com/Hazelsuko07/EMA.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_76

SharedIt: https://rdcu.be/cyl6T

Link to the code repository

https://github.com/Hazelsuko07/EMA

Link to the dataset(s)

MNIST: http://yann.lecun.com/exdb/mnist/

SVHN: http://ufldl.stanford.edu/housenumbers/

COVIDx: https://github.com/ieee8023/covid-chestxray-dataset

ChildX: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia/version/1


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper aims to solve the problem of data auditing (verify if a set of data is removed/memorized/forgotten). After spotting two limitations of the previous work, the authors propose an ensembling method-based membership inference attack to address these issues. Results on benchmark datasets and medical datasets show the superiority of the method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors clearly spotted the limitations of the previous SOTA on medical data auditing by analyzing the theory and the experimental results.
    2. The proposed method is technically sound. The authors used three metrics i.e. correctness, confidence, and negative entropy for sample-wise auditing, where [8] uses only confidence. Moreover, the authors mentioned that simple majority voting fails, which motivates the distribution-wise auditing with the K-S test.
    3. Figure 1 is clear and helpful to understand the problem and method.
    4. The authors used two Chest X-ray datasets including a COVID dataset, which is a plus.
    5. The code is available at the anonymous link, which is good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors should also include the results of majority voting as an ‘ensembling baseline’. It is natural to extend MIA to dataset-wise auditing by checking if each sample is in the training dataset and then do majority voting. However, this has the problem of not considering the distribution-wise statistical overlap, which is the major difference between [8] and MIA. I also suggest the authors discuss this more to emphasize that when considering auditing a dataset, simply extending MIA is problematic and hence the problem of ‘dataset auditing’ is not trivial.
    2. Membership inference attack (MIA) is a popular and large research field. But the content in this paper about MIA is not self-contained. For example, why the calibration model can be a ‘reference model’ for the decision rule? I highly suggest the authors include more content about MIA including the overall high-level idea, the recent work on MIA, etc to benefit more the community.
    3. In terms of the quality of the calibration dataset, I suggest the authors also vary the number of data points. In this paper, low quality means ‘noisy’ and rotated. In [8], the authors assumed that the calibration dataset is large as they can sample as much as possible (which is doable in practice) to mimic the training dataset.
    4. Can you please elaborate more about the setting of threshold ‘Algha’. In this paper, it is set to 0.1 by default. Is there any reason that it is 0.1 rather than other numbers? Does this affect the accuracy/performance of the method? For different datasets, should it be altered?
    5. The authors claim that [8] fails when there are only a few classes. The reason is K-S distance here is not a good measure. In the proposed method, the authors also used K-S test. Why the proposed method does not fail in this case? I suggest the authors to elaborate more on this. Alternatively, the authors should give the clear definitions/equations of K-S test, MIA algorithams, confidence, negative entropy, etc.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code is available. This work is reproduciable.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    See weaknesses.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are some problems in this paper (see weaknesses). Overall, I think it is a good paper in privacy problem/data auditing in medical domain when considering the novelty of the proposed method, the quality of writting, the experiments and the good results. Hence, I recommand to accept this paper.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The papers propose a method for auditing a trained model to check if data has been removed from the training set. The procedure involves estimating a number of metrics on each item in the ‘query dataset’, and inferring membership based on thresholds obtained from a calibration dataset that has a similar distribution to the training data. The method is evaluated two classification tasks: digit classification with MNIST, and x-ray pneumonia classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method offers a substantial improvement over the method by Liu et al., in the results shown, and appears more robust to a poor quality calibration set.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I’m not convinced the evaluation represents a real-world scenario
    • There are so many variables that could be altered in the evaluation that it is hard to know just how well the method performs in practice
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Relatively good. Anonymous link to code is made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    I would like to see how EMA performs as a function of the number of data points removed. I imagine it would often be the case that a very small number would be, so the situation evaluated here (where 20% of the data is removed), don’t seem to represent a realistic scenario.

    I would be interested to understand how the performance of EMA varies with the performance of the trained model itself. For example, presumably the task is easier if the model is overfit to the training data. Does EMA perform worse on a model that generalises well to new datasets?

    I would like to see some justification for the 3 metrics chosen (correctness, confidence, and negative entropy). As written, they appear plucked out of thin air. Did you evaluate other metrics before you decided to use these three? What motivates these particular 3?

    There are many ways calibration sets can be degraded, and the methods used here (addition of Gaussian noise and rotation) represent a very small subset of those. I would really like to see a broader set of degradations applied, or even data directly taken from a different distribution. Of course, I understand such comprehensive evaluation might fall outside the limits of what is realistically achievable in a MICCAI paper, but it is something to bear in mind if this work is extended to a full paper.

    It would be informative to reduce the k parameter even more - below 50 - to get an understanding of when EMA starts to breakdown.

    The claims made in Table 1 of the method, that EMA does not require a high quality calibration set and that it is robust to a query set similar to the training set, seem to me empirically determined results rather than properties that obviously emerge from the EMA algorithm. Consequently it seems strange to me to claim these as features in Table 1 in the methods section.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The improvements in the method seem significant enough that I suspect this paper would be of interest to the community. I have some misgivings about how realistic the evaluation is, but understand that space for comprehensive evaluation is limited in a conference paper.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    4

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel method to detect whether certain training data has been memorised by a trained model and a 2-step method to remove it. The proposed method improves the cost efficiency of previous approaches and provides better resiliency against various practical challenges. The paper demonstrates the results using benchmark datasets and a Chest-X Ray dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Data sensitivity is an on-going challenge and the prospect of having to retrain a model each and every time a data donor decides they would like to revoke their permission is certainly a painful one, so a method that can cleanly extract specific contributions could be a benefit to the community. It is also an interesting data science problem in its own right. The results appear to show that the method is effective at detecting whether a dataset has been removed from a trained model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I’m not sure that is realistic to require immediate extraction of a specific data contribution from a model. It is more likely that the next retraining would exclude any donors who have revoked their permissions in the meantime, in the same way as you might include new contributions. In the absence of a method to extract specific contributions without retraining, is it sufficient to provide a metric that predicts whether a dataset has been memorised? Data removal from a trained model is more about perception than data science, so the reality is that this method would never be a sufficient substitute to retraining. The method is initially validated using 3-layer MLP, which is considerably less complex than today’s medical imaging models, which raises the question of how robust this technique would be on real-world models. Later experiment does use ResNet, but has some other issues related to the ChildX dataset used. Whilst it is appreciated that the method proposed and the contrasting prior technique use different metrics, this makes comparison rather more difficult for the reader to interpret. Figure 2 is rather difficult to decipher - the caption does little to explain. The Chest-X-ray dataset experiment using the Childx dataset is not a very credible evaluation since there are fundamental differences between childrens’ anatomy that would make it substantially different from anything else in the training set.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code is provided by URL in the paper

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    “The auditory institution…” I think the use of the word “auditory” is mistaken here - should probably be “auditing”. I would rephrase the parts that refer to “…method for data removal…”. It sounds as if the proposed method is doing the removal as well as the detection.

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think that the motivation for this work is somewhat unrealistic. It seems unlikely that the proposed method is likely to be a solution to anything of practical value. I am also not sure that the fact that one experiment was based on a COVID dataset makes this of sufficient interest to the MICCAI community.

  • What is the ranking of this paper in your review stack?

    4

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers agree on the merits of the method and improvements over the baseline. The method is well thought, the article is well written and results are convincing. R1 and R3 are very positive about the article. R4 raises an important concern about the scenario the authors apply their methods and whether the proposed method will have any practical value for medical image computing. I agree with these concerns and would like to invite authors to respond to these concerns in a rebuttal. In terms of topic, it is not a usual MICCAI topic but I think it would lead to interesting discussions in the conference.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8




Author Feedback

We thank all the reviewers. We are grateful that R1 & 3 like our data auditing work’s novelty and significance. We are sorry that R4 misinterprets our work.

1.Reality of the setup - R4 We believe that R4 misunderstood EMA as a data removal method, leading to the claim that our problem was not realistic. R4 summarized our work in Q1 that ‘proposes xxx a 2-step method to remove it (data).’ and stated ‘the proposed method is doing the removal’ in Q6. These statements are INCORRECT. EMA is not a data removal process, but an auditing process. R4 also misinterpreted that EMA ‘require immediate extraction of certain data is unrealistic than retraining’. We did not have the requirement. In fact, EMA auditing can work on whatever data removal methods.

In many parts of our paper, we stated that EMA ‘is a method to measure if certain data are memorized in the trained model’ or ‘to audit if data has been removed from the training set.’ The problem we considered seems to be clear to R1 & 3. Nevertheless, we will further clarify the problem setting in our revision.

  1. Practical value for medical image computing - R4 We study the problem under the real background as stated in the introduction that ‘regulations require healthcare providers to allow individuals to revoke data.’ The same goal shared with a MICCAI2020 paper [8], we provide a solution to audit if data is used to train the model, for example someone claims having removed data. R4 mistook us for proposing a solution of data removal, thus challenging the practical value. To show the potential application in the medical imaging field, we validated our methods using two large X-ray datasets, which R1 treated as a plus. We believe this topic is essential for nowadays AI practitioners in the medical imaging field to ensure secure deployment.

  2. Rationality of using ChildX dataset (children’ Xray) as query data - R4 We agree that ChildX is different from COVIDX. This is designed by purpose: we aim to test EMA’s performance when the query dataset is different from the data used in training. This is similar to the design of the benchmark setting, where training uses MNIST but we also test SVHN as a query dataset. Also, it is consistent with the setting in the baseline [8].

  3. Model and data selection - R4 We design a simple model for the benchmark experiment to verify the effectiveness of the proposed method. As R4 admitted, we used a more complex model for the real medical dataset. We clarified the rationality of selecting ChildX in Response 2.

  4. Parameters discussion in realistic settings - R1 & 3. Due to space limitation, we mentioned in Section 5 that discussing more parameters as our future work. Nevertheless, we showed a few more results in rebuttal to show the flexibilities in practice. -# of removed (query) data M (R3): We varied M, and found out EMA works robustly when M > 20. -# of calibration data N (R1): Compared to [8], we used fewer calibration data (2000) but achieved better results. During rebuttal, we ran additional experiments varying N and found our method stably successful when N > 1000. -Alternative degradation on calibration sets (R3): We validated the feasibility of using calibration data that are different from training data (e.g. use FashionMNIST as the calibration data while training uses MNIST), which is supported by []. [] Salem et al. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. -$alpha$ (R1): $alpha$ is the threshold for the probability of forgetting, a.k.a p-value. We used 0.1 as it is typical for significance testing. In the real deployment, $alpha$ can be selected by user’s preference to the significance.

  5. Comparison with majority voting (MV) - R1 We agree with R1 that MV is a baseline. We had the results of MV which gave all false positive detection on the dataset not used for training. This is why we state “majority voting may not achieve reliable results” in Section 3.2.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with Authors’ rebuttal. They well address R4’s comments. While this article is a bit off from main topics of MICCAI, it is an interesting contribution that may eventually become an important topic with the increasing regulations.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    11



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This submission proposes a method for data auditing of a trained model. All reviewers agree that this is of major interest to the MICCAI community and that the results are impressive. In the rebuttal, the authors addressed the concerns raised by the reviewers. However, I would like to point out that not only R4, but also R3 raised the concern that the presented setup may not be realistic. Taking all reviews and the rebuttal into consideration, I’m inclined to accept the paper. It would be beneficial to discuss this a bit more in the manuscript and I would like to encourage the authors to make it very clear in the final version that the presented method is not for the removal of the data from the model to avoid confusion.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper addresses an interesting question which might become more important as machine learning models start to become used more widely in actual clinical practice. It is a relatively less explored direction and I believe would be of interest to the MICCAI community. The rebuttal addressed the major concerns raise by one of the reviewers in my opinion.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



back to top