Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Sergio Naval Marimont, Giacomo Tarroni

Abstract

We propose a novel unsupervised out-of-distribution detection method for medical images based on implicit fields image representations. In our approach, an auto-decoder feed-forward neural network learns the distribution of healthy images in the form of a mapping between spatial coordinates and probabilities over a proxy for tissue types. At inference time, the learnt distribution is used to retrieve, from a given test image, a restoration, i.e. an image maximally consistent with the input one but belonging to the healthy distribution. Anomalies are localized using the voxel-wise probability predicted by our model for the restored image. We tested our approach in the task of unsupervised localization of gliomas on brain MR images, and compared it to several others VAE-based anomaly detection methods. Results show that the proposed technique substantially outperforms them (average DICE 0.640 vs 0.518 for the best performing VAE-based alternative) while also requiring considerably less computing time.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_18

SharedIt: https://rdcu.be/cyl1I

Link to the code repository

https://github.com/snavalm/ifl_unsup_anom_det

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents an anomaly detection method that relies on an implicit field learning mechanism. An auto-decoder is used to learn the latent distribution of intensity-encoded images and maps spatial coordinates to intensity cluster probabilities. In inference time, these probabilities are then used to compute a voxel-wise anomaly score.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Proposed adaptations for implicit field networiks are interesting.
    • With an auto-decoder based model, restored images are expected to be maximally faithful to the input image while complying with the healthy distribution.
    • The implicit field mechanism, which learns voxel-wise mappings, decouple the trained model from the input image resolution and effectively increase the training size (results reported are based on trained models without data augmentation).
    • Comparative results with VAE-based methods are convincing.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper lacks justifications of design choices such as the need for coordinate encoding and the benefits of the auto-decoder versus auto-encoder (given the compute time discussions in the results section).
    • The proposed intensity encoding ignores spatial correlations, and relies on a subsequent denoising (mode-pooling).
    • No discussion about how hyperparameters were select, e.g. number of clusters for intensity encoding.
    • The paper does not report comparisons with GAN-based methods.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Authors made code and data available.
    • The paper contains enough technical details for a reader to reproduce the method.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • How are hyperparameters selected?
    • What is the rationale behind the coordinate encoding function? Is using the coordinate values without encoding expected to have poor performance?
    • Why are different mode-pooling filter sizes used for training and testing/validation sets?
    • Is the proposed method expected to outperform GAN-based methods (e.g., AnoGAN, which rely on optimized latents)? Why or why not?
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting adaptation to the implicit field networks for the anamoly detection task. Results and comparison with VAE-based methods are convincing. However, the paper needs to be revised to further expand on the justification of different design choices, the selection of hyperparameters, and comparisons with GAN-based methods.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The paper proposes to use implicit representations for localized anomaly detection, which to the best of my knowledge is a novel approach to the problem. A distribution over healthy samples is learned via an auto-decoder. At test time, pixel-wise anomaly scores are constructed as negative log-likelihoods of the predicted intensity class probabilities (the authors bin intensities with k-means). Training on HCP and testing on BraTS 2018, the authors report impressive results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The methodology is novel and the results are very good

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some design decisions appear a bit arbitrary and it is not entirely clear, how much of the performance increase over baselines is due to the methodology and how much due to simpler steps like the post-processing. The method is only demonstrated on a single dataset.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors state that “Implementation, trained models and test sets are made publicly available” (although I don’t understand how the authors want to make BraTS data available without appropriate rights, and the data is available by request anyway)

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    I think your paper would benefit most from testing on more datasets. I trust that all baselines are tested on the binned intensities as well? What happens if you don’t do this binning, it’s not like your method is incapable of working with continuous values. The AS would be different, so I’m getting the feeling that results wouldn’t have been as impressive? Would be interesting to at least see results with a much larger number of classes. Why do you sample K random points instead of using the entire image for the latent codes (especially at test time)? Some processing steps seem somewhat ad-hoc, leaving me with the impression that they were only added to improve your model relative to the baselines. I’m specifically referring to the mode-pooling and the post-processing. Would be great to see ablations Baselines in 3D would make the paper even stronger, is there really nothing out there that you could have compared yourselves with?

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novel method with very good results, presented clearly. More datasets would move this towards a clearer accept.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    This paper introduces VAE anomaly detection methods based on Implicit field learning on 3D medical photographs. Experiments are conducted on brain MR images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is mostly well-written and clear.
    2. The proposed methods explore the use of implicit field learning in 3D medical images.
    3. The method achieves good performance in terms of AUC, although lacks some baselines(See comments below).
    4. Most of the aspects are described in sufficient detail to enable the reproduction of results.
    5. Anomaly detection based on voxels.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. An ablation between original implicit field learning and the proposed one could better demonstrate the contributions of the paper.
    2. The paper shows the SOTA results when compared with VAEs. However, it lacks most important baselines in the table, such as anogan [ref1], f-anogan [ref2] , ADGAN [ref3] or Zimmerer et al. [ref4].
    3. More detailed ablation studies would better clarify the contribution of different component.

    Reference:

    ref1: Schlegl, Thomas, et al. “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery.” International conference on information processing in medical imaging. Springer, Cham, 2017.

    ref2: Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G. and Schmidt-Erfurth, U., 2019. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54, pp.30-44.

    ref3: Liu, Y., Tian, Y., Maicas, G., Pu, L.Z.C.T., Singh, R., Verjans, J.W. and Carneiro, G., 2020, April. Photoshopping colonoscopy video frames. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) (pp. 1-5). IEEE.

    ref4: Zimmerer, D., Kohl, S.A., Petersen, J., Isensee, F. and Maier-Hein, K.H., 2018. Context-encoding variational autoencoder for unsupervised anomaly detection. arXiv preprint arXiv:1812.05941.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author claim to make sure the reproducibility of this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The paper is well writtee and clear to understand. However, I have some concern about the experiments.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the technical novelty of the proposed method can be considered as sufficient. At the moment, there are several issues that can be addressed during the rebuttal.

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The article received very positive reviews from all reviewers. There is clear technical novelty and experimental results back-up the novelty as well. I would like to raise two concerns for the authors. First is the large discrepancy between the results provided in the comparative study of Baur et al. (cited as [2]) and here. Can authors please explain this discrepancy? Second, can having an explicit dependence on coordinates create an adverse effect for other anatomies?

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

Regarding hyper-parameters, pre-processing and AS post-processing, we used the validation set described in our submission to tune hyper-parameters, including the number of bins for intensities (we evaluated 7, 10 and 15), image smoothing, post-processing of Anomaly Scores (AS). In 3D-IF, using mode-pool kernels of size 3 produced slightly better result in the validation set (Kernel of size 3 increased [DICE] by ~0.015 vs size 2 and ~0.035 vs no mode-pooling). We used the same post-processing pipeline of AS in the baselines and in our method, (min-pooling + average pooling as proposed by authors of [6]), tuning as hyper-parameter the kernel-size of the average pooling layer.

With regards encoding function, we do not expect performance to decrease significantly without it although we did not run the ablation. We included the encoding function relying on the empirical results shown in https://arxiv.org/abs/2003.08934. Authors justify the encoding function using the following statement “This is consistent with recent work by Rahaman et al. (2018), which shows that deep networks are biased towards learning lower frequency functions. They additionally show that mapping the inputs to a higher dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation”.

Regarding the baselines selected, based on the findings from the comparative study of Baur et al. (cited as [2]), TABLE IV (GB - Glioma dataset), it seems that VAE-based approaches outperform GAN-based ones in our task. This is why we included only the former in our experiments. Using DICE as metric, [2] results were as follows: VAE-restoration 0.435 (top-performing), VAE 0.374, f-AnoGAN 0.379 and Context VAE 0.333. Additionally, we included the baseline method [6] that proposed a different restoration method but was not included in the comparative study [2].

Baseline AS are as proposed by original authors and consequently do not include intensity bins. In future work, we intend to expand our comparisons by including experiments with IF trained to reconstruct continuous intensity values and VAE trained to reconstruct intensity bins. We expect that intensity bins / cross-entropy as Anomaly Scores will improve performance in VAEs compared to the standard L1/L2 distance, specially in anomalies with intensities near to the intensities of normal tissues. We assume that when using L1/L2 between image and reconstruction as Anomaly Score, only anomalies with very different intensities compared to normal tissues are identified. Using intensity bins + cross-entropy as AS will alleviate this issue.

In our implementation of VAE baselines, we adapted the VQ-VAE architecture that authors of [6] made publicly available. As a result, our VAE is deeper than the one used in [2], used residual connections, batch-normalization and different activation functions. We also performed hyper-parameter tuning for the VAE baseline methods. The latent size performing best was only 10 ([2] used 128) and performance quickly degraded when increasing latent size (VAEs with larger latent sizes reconstructed anomalies). Our VAE with a more expressive architecture and a smaller latent space achieved 0.447 in our dataset (vs 0.374 in [2]). Additionally, when using the restoration approach to generate Anomaly Scores (AS), we could not replicate the findings from [2] and VAE-restoration underperformed the VAE reconstruction loss as AS (0.390 in our experiments vs 0.435 in [2]). As indicated in the last paragraph in Section 3, the network architecture differences could be limiting the effectivity of the restoration method (we hypothesize that residual connections or batch-normalization could be affecting the gradients w.r.t pixel intensities).

We agree with the Reviewers on the need for additional experiments. In the future, we intend to test our approach with MS anomalies as proposed in [2] and different image modalities.



back to top