Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

James Browning, Micha Kornreich, Aubrey Chow, Jayashri Pawar, Li Zhang, Richard Herzog, Benjamin L. Odry

Abstract

Deep reinforcement learning (DRL) is a promising technique for anatomical landmark detection in 3D medical images and a useful first step in automated medical imaging pathology detection. However, deployment of landmark detection in a pathology detection pipeline requires a self-assessment process to identify out-of-distribution images for manual review. We therefore propose a novel method derived from the full-width-half-maxima of q-value probability distributions for estimating the uncertainty of a distributional deep q-learning (dist-DQN) landmark detection agent. We trained two dist-DQN models targeting the locations of knee fibular styloid and intercondylar eminence of the tibia, using 1552 MR sequences (Sagittal PD, PDFS and T2FS) with an approximate 75%, 5%, 20% training, validation, and test split. Error for the two landmarks was 3.25 ± 0.12 mm and 3.06 ± 0.10 mm respectively (mean ± standard error). Mean error for the two land-marks was 28% lower than a non-distributional DQN baseline (3.16 ± 0.11 mm vs 4.36 ± 0.27 mm). Additionally, we demonstrate that the dist-DQN derived uncertainty metric has an AUC of 0.91 for predicting out-of-distribution images with a specificity of 0.77 at sensitivity 0.90, illustrating the double benefit of improved error rate and the ability to defer reviews to experts.

Note: For research only, not available as a commercial product. Due to regulatory reasons its future availability cannot be guaranteed.



Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_60

SharedIt: https://rdcu.be/cyl4U

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method to evaluate uncertainty in a landmark localisation scenario using reinforcement learning. The authors show how two metrics to assess the uncertainty of the method, Shannon entropy and FWHM from the discrete q value distributions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clear and adequately written paper.
    • Experimental results seem promising performing better in the given dataset to non uncertainty aware methods
    • In depth experimental argumentation to explain claims made at the paper, actually that was quite good and well thought out
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some structure issues, for example : No dedicated related works section. Section 2.4 should be with section 3 experimentation as it does not provide any methodological information rather details the experimental training setup.

    • Some citations are mixed. In the introduction, Alansary et al 2019 does not tackle multi-agent RL hence should be grouped together with Ghesu et al 2019. The MARL approach mentioned and that the authors also borrow a part of for their methods is [1]. The authors are kindly asked to include the proper citation and move the Alansary et al citation above together with Ghesu et al.

    • It is a bit unclear to me what is the actual methodological contribution is. The underlying RL method for medical images has been introduced by Ghesu et al and then expanded by Alansary et al and Vlontzos et al. to also cover MARL scenarios. The authors of this paper propose the use of 2 metrics which have been previously proposed in literature, as the authors correctly cite, to evaluate the uncertainty of the trained models. Hence the contribution of the paper is this post analysis during inference. As such the paper should not be considered as a methodologically heavy paper but as a analysis paper.

    To that effect, the paper does indeed analyse the certainty of such methods to a satisfactory level. However, results of methods would be far easier to analyse if presented in a table. In addition the authors do not mention or compare with other bayesian techniques for uncertainty estimation, some of them are [2,3,4]. Note this is list is far from exhaustive and its purpose is more as a starting step. That is a major issue when it comes to the analysis merit of this paper as, in this perspective, it’s incomplete. In other words the authors have adequately explored the capabilities of the two metrics they propose on their own, and against non uncertainty aware methods. But there is no comparison to other uncertainty estimation and out of distribution flagging methods used in RL. To be clear the point is not comparing against every method published as this would be unfeasible, but at least 1 or 2 methods would be beneficial to the paper.

    [1] Multiple landmark detection using multi-agent reinforcement learning, A. Vlontzos et al , MICCAI 2019 [2] Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Y. Gal. et al, ICML 2016 [3] Uncertainty-Aware Reinforcement Learning for Collision Avoidance, Khan et al , 2017 [4] Estimating Risk and Uncertainty in Deep Reinforcement Learning, Clements et al, ICML 2020

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Code not provided,
    • Given the heavy dependence of these methods on open sourced projects it should not be impossible to replicate the results of this paper
    • the majority of the hyper-parameters of the models quoted are there .
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Most of my comments can be found above in the strengths and weaknesses fields. As an overall comment I will be considering this paper as an analysis / probe paper rather than a methodological novelty one.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As mentioned above this paper can be described as a probe paper to the abilities of inferring a RL. The analysis is satisfactory for the two metrics mentioned in the paper on their own and against other medical imaging RL methods that are not uncertainty aware, but it lacks in comparing with standard uncertainty estimation methods in RL that have been proposed in recent years

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    The paper presents a distributional reinforcement learning-based method for landmark detection from 3D MRIs and introduces an uncertainty measure to estimate the uncertainty of landmark prediction as well as identifying out-of-distribution images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Most of the sections of the paper are well written. It’s clear, easy to follow, and contains all the necessary information for understanding and reproducibility.

    2. It clearly mentions the data processing steps, model implementations used.

    3. The MDP formulation is clear. Components such as state space, action space, reward function, the notion of episodes, task description, etc. have been explained clearly.

    4. The paper contains almost all the parameters necessary for reproducibility.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Major weakness and reason for rejection:

    In section 2.4, the authors mentioned that they have used the Ape-X distributed prioritized experience replay DQN algorithm from the ray/rllib library (https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/apex.py). But this is the ‘distributed’ version of the DQN (https://arxiv.org/pdf/1803.00933.pdf), which is very different from the ‘distributional’ DQN (equation 1), the main model of this paper. The distributed DQN disentangles acting and learning and thus gets a performance boost, but it still updates the regular Bellman equation (not the distributional one). In that case, unfortunately, the methods proposed by the authors do not match their experiments and hence none of the claims are valid anymore.

    1. Other weakness: It’s unclear from the paper what the non-distributional baselines are. Section 2.4 mentions “Double DQN and Noisy Nets were employed for training”. However, it’s unclear whether they were used as baselines because the first paragraph of section 2.4 seems to be dedicated to the main model. Also, the next paragraph says, “As a baseline for landmark detection accuracy comparison, an additional agent was trained for each of the two landmarks without distributional DQN but with other training and inference parameters held the same.” This raises some confusion around what the baseline models are.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper contains almost all the necessary information (the CNN architecture details are unclear) for reproducibility. The authors have done a great job at mentioning the minute details necessary for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. If the authors have mistakenly used the distributed DQN instead of the distributional DQN, they either need to rewrite the paper as ‘distributed DQN’ with the same experiments or redo the experiments with the correct implementation without changing their hypothesis.

    2. The paper says the authors have trained 2 separate agents for knee fibular styloid and intercondylar eminence of the tibia. It’s unclear why a single agent can’t handle both kinds. An explanation around that intuition will benefit the understanding of the readers from various backgrounds.

    3. Modifying the CNN block in figure 1 with the architecture, filter size, stride, nature of convolution (2D/3D), maxpool size etc will make the figure more informative and the paper more reproducible without any additional text (hence increasing the paper length). Something like this tool/visualization (https://github.com/ashishpatel26/Tools-to-Design-or-Visualize-Architecture-of-Neural-Network) might help.

  • Please state your overall opinion of the paper

    probably reject (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and the idea is good. However, the authors seem to have confused between ‘distributed DQN’ and ‘distributional DQN’, which are very different (pointed out in the weakness section). The implementation that the authors mentioned that they’ve used is the former, whereas the hypothesis and methods presented in the paper are based on the later. This makes the results of presented invalid and the hypothesis presented untested, leading to rejection.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    This paper presents a novel method to estimate uncertainly for deep reinforcement learning in landmark detection in medical imaging. This is a topic of interest given that in pathology detection pipeline requires a self-assessment process to identify out-of-distribution images or failure cases. To achieve this the authors, propose full-width-half-maxima of q-value probability distributions for estimating the uncertainty of a distributional deep q-learning (dist-DQN) for landmark detection. They show that distributional deep q-learning result in better performance compare to non-distributional. Furthermore, the authors show that the derived uncertainty metric from dist-DQN can be used to detect failures and out-of-distribution samples.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Their experiments clearly demonstrate higher accuracy of distributional q-learning for landmark detection compared to a non-distributional baseline, which is an interesting finding. • They propose a novel uncertainty measure for landmark location 𝑈ˆFWHM, derived from the full- width-half-maxima (FWHM) of q-value probability distributions and that can be used to identify out-of-distribution images and landmark prediction errors. • This work presents an interesting analysis of uncertainty estimation of previous landmark detection frameworks from Alansary et al. and Ghesu et al. (Alansary et al. 2019, Ghesu et al. 2019). The contribution is mainly on the ability of the system to detect errors and out-of-distribution samples rather than the superiority for landmark detection although it is superior.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Reference style seems to be the wrong style for MICCAI. Author should change this. • Without being familiar to distributional deep reinforcement learning, the paper may be difficult to follow. The authors are encouraged to improve this part of the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors do not include statements regarding the reproducibility or open access to code or data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    numbers for each experiment since it is mixed with text. The manuscript can easily be improved by adding a table of results where all 3 experiments are summarized and compared to the non-distributional. • For experiment number 2. What are the insights of why uncertaintly estimation works better for 𝑈ˆFWHM vs Uˆh for out of distribution detection and error detection? • Single agent experiments could be included to better understand the contribution of all components: 1) dist-qlearning vs non-distritutional single agent, 2) uncertainty measure from single agent and the combination of these sets. This will allow us to know which part of the paper is contributing.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents an interesting and needed analysis on uncertainly and out-of-distribution detection in medical imaging and deep reinforcement learning, in particular for landmark detection. I believe that this paper has scientific value and interest for the MICCAI community. In particular for medical applications utilizing reinforcement learning.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    After carefully considering the reviewers comments and going through the paper, I would like to invite tha authors to respond to the comments made by the reviewers. Specifically, each of the reviewers have brought up imporant comments regarding aspects of the paper. In particular, we point the authors to the point made by R2 regarding the difference between distributed DQN’ and ‘distributional DQN’, and as them to clarify this point of view explicitly.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    6




Author Feedback

Dear MICCAI 2021 Area Chairs, Program Chairs, and Reviewers,

Thank you for considering our submitted manuscript entitled Uncertainty Aware Deep Reinforcement Learning for Anatomical Landmark Detection in Medical Images. We would like to take this opportunity to respond to some minor criticisms of our paper, as well as provide clarification regarding a significant criticism which we strongly believe stems from a semantic misunderstanding of our work rather than a flaw in the work.

Reviewer 2 has recommended not accepting the paper due to a perceived ambiguity in the terminology “distributed” vs. “distributional” used in current DRL literature as well as our paper. We thank the reviewer for highlighting this issue and consider it the most critical to address. To clarify, we are using the Ape-X algorithm to train the agents which is a distributed prioritized experience replay algorithm and, as the reviewer correctly notes, asynchronously performs agent rollouts and model back-propagation across multiple workers, hence the term “distributed”. While using Ape-X for training, we are training a distributional DQN agent(s) in which the last layer of the dense policy network encodes discrete distributions for Q-value across all allowable actions, as opposed to vanilla DQN in which a single node encodes Q-value for each action. Therefore, equation 1 and the consequential claims hold for the distributional DQN we use.

Our implementation uses the ray/rllib ApexTrainer module from rllib.agents.dqn.dqn with num_atoms = 64, v_min = -10, and v_max = 20 as described in the paper which simultaneously enables both Ape-X distributed training and distributional DQN. For purposes of clarity in the paper, and to avoid reader confusion, we have removed the phrase “Ape-X distributed prioritized experience replay” in favor of the simpler “Ape-X”, the details of which can be found in the referenced ray/rllib documentation and Horgan et al, and which is neither a novel aspect of our work nor the focus of our paper.

Reviewer 1 gives generally positive reviews, but notes a few structural issues, including a potential improvement to the citation for the originator of multi-agent landmark detection, additional citations for related work, and moving a block of text from the methods section to the experiment and results section. We thank the reviewer for these excellent suggestions and have improved the draft accordingly. Reviewer 1 also notes our lack of comparisons to other uncertainty methods including scholastic dropout with bootstrapping and bayesian networks. Despite this limitation, we feel that a significant strength of this paper is the description and analysis of a novel method for uncertainty estimation for the critical purpose of flagging out-of-distribution data or inaccurate predictions for manual review that has been validated on real-world clinical data and can be easily implemented in current DRL frameworks (e.g. ray/rllib in combination with OpenAI gym). We thank the reviewer for this good suggestion, and may explore it in future work.

Reviewer 3 brings up a great point regarding a theoretical explanation for the superior performance of FWHM versus Shannon Entropy of the distributions for flagging out-of-distribution data and inaccurate landmark predictions. Although we do not have a strong theoretical explanation to include in the paper, we posit a few ideas here; 1) Heuristically, a wider distribution should correlate with more uncertainty in q-values. Although entropy is maximum for a uniform (i.e. maximally “wide”) distribution, it is also a measure of randomness, which is not necessarily the aspect of the q-value distribution we are trying to measure and 2) Shannon entropy is often derived in the context of probabilities of discrete events. However, the q-value distribution is not a probability distribution over discrete events, but rather a learned discrete approximation to an underlying continuous probability function.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have made important clarification remarks in their rebutall and I beleive this work would be suitable for publication at MICCAI now. I would highly recommend the authors clarify the improtant elements raised here in their final revision.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Showing that the probability distributions over q-values can improve localization accuracy and be used to determine out-of-distribution samples is a valuable inside interest for the MICCAI community. However, I think that manuscript is cannot be brought into the form required for publication on MICCAI with a minor change proposed by the authors in the rebuttal. First, the authors should clarify where is the contribution of their manuscript, especially compared to Bellemare 2017. Since I also agree with reviewer 1, that the analysis of the prob-q-values during inference is their main contribution, this part of the manuscript needs a significant rewriting to make it more readable. Finally, I also think there is a lack of comparison with other networks that can provide uncertainty estimation.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Reviewers generally agree that there is merit in this work. The rebuttal addresses the main concern of reviewer 2, clarifying that distributional DQN agents are trained within their implementation, thus keeping their hypothesis and validation unscathed. Overall, while there are a number of minor improvements required regarding the final manuscript, the analysis of uncertainty in RL based landmark localization is timely and important, and thus should be presented to the MICCAI community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    2



back to top