Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Bowen Li, Xinping Ren, Ke Yan, Le Lu, Lingyun Huang, Guotong Xie, Jing Xiao, Dar‐In Tai, Adam P. Harrison

Abstract

Depending on the application, radiological diagnoses can be associated with high inter- and intra-rater variabilities. Most computer-aided diagnosis (CAD) solutions treat such data as incontrovertible, exposing learning algorithms to considerable and possibly contradictory label noise and biases. Thus, managing subjectivity in labels is a fundamental problem in medical imaging analysis. To address this challenge, we introduce auto-decoded deep latent embeddings (ADDLE), which explicitly models the tendencies of each rater using an auto-decoder framework. After a simple linear transformation, the latent variables can be injected into any backbone at any and multiple points, allowing the model to account for rater-specific effects on the diagnosis. Importantly, ADDLE does not expect multiple raters per image in training, meaning it can readily learn from data mined from hospital archives. Moreover, the complexity of training ADDLE does not increase as more raters are added. During inference each rater can be simulated and a “mean” or “greedy” virtual rating can be produced. We test ADDLE on the problem of liver steatosis diagnosis from 2D ultrasound (US) by collecting 46084 studies along with clinical US diagnoses originating from 65 different raters. We evaluated diagnostic performance using a separate dataset with gold-standard biopsy diagnoses. ADDLE can improve the partial areas under the curve (AUCs) for diagnosing severe steatosis by 10.5% over standard classifiers while outperforming other annotator-noise approaches, including those requiring 65 times the parameters.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_26

SharedIt: https://rdcu.be/cyl50

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes an interesting solution to combine multi-raters’ subjectivity information as latent embeddings with image input to assist the image classification task. Each rater is initially represented by an embedding vector and fed into an auto-decoder network and a linear transformation to get the latent features that can be injected to the backbone classification network. During the training, the embedding vectors are updated synchronously with the weights of auto-decoder and classification backbone at the first place. Then to deal with the unbalanced numbers of samples labeled by specific raters, an asynchronous updating strategy is also applied to fine-tune each latent embedding. During the reference, the final output can be achieved by averaging multiple outputs of the network by utilizing either top-k best raters (greedy) or all the raters. An experiment on a private large ultrasound dataset of liver steatosis diagnosis is conducted to show the proposed ADDLE solution have

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method targets an important but less investigated problem, how to utilize raters intermediate rating information rather than the final consensus as the supervision to improve the classification performance. Their solution is theatrically correct and practically achieves better performance compared to baseline methods.
    The paper is easy to follow and understand; the method is not hard to reproduce and can be plugged into difference backbones.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are previous works also trying to embed the multi-rater information with a different solution and should be mentioned in the paper and compared, such as [1]. Public datasets are better source for benchmarking different algorithms. REFUGE and DRISHT datasets are also collected for a classification task with multi-rater label information. Authors are recommended to test their method on these datasets. The technical novelty is limited as the major auto-decoder for utilizing embedding is based on existing works.
    [1] Difficulty-aware Glaucoma Classification with Multi-Rater Consensus Modeling. MICCAI 2020

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Though the code is not offered, it is straight forward to reproduce the method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Section 4 includes my suggestions.

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A good solution to a challenging task and it achieves good results, though more experiment results with other datasets are expected to help thoroughly analysis the proposed work.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The paper presents a method to tackle the inter reader variability of radiological interpretations by proposing auto decoded deep latent representations learned from labels from several readers. The latent representations from the subjective labels can be incorporated to any training paradigm to account for inter-rater variabilities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The study is conducted on a large dataset of 3790 patients with labels from 65 different clinicians, allowing the development of models that capture inter-rater variability.
    2. The proposed approach to learn latent representations from subjective labels can be incorporated into different kinds of automated tasks, including but not limited to classification.
    3. The proposed approach can tackle inter-reader variability and will be an important step towards removal of bias due to noisy/subjective ground truth labels.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some of the decision choices for the approach needs better explanation/clarity, e.g., equations 1 and 2. Reference to supplementary material not provided in main paper, and the supplementary Fig 1 is not self-explanatory. It is unclear whether the proposed approach can be used for a different problem domain without retraining. Are the latent representations domain independent? This is important to understand because it is not always easy to collect a dataset with ratings from 65 different clinicians. It is also not clear how many raters labeled one particular scan on an average – did all the 65 raters read each and every patient scan? Are these latent representations dependent on the raters and their experience? How does the classification model perform if one reader is chosen and random for ground truth labels? Not clear how the multi-class classification problem is tacked using the AUCs in the section on evaluation protocols.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    From the reproducibility statement, it seems that the dataset will not be made publicly available at this moment. However, it would be of immense value if the dataset were made public. No link to code or trained model provided in manuscript.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The paper is novel and interesting and would be useful to account for inter-rater variability. However, some discussion about how the model and approach can be translated to other domain/diseases would be useful. It is difficult to acquire similar quality data for training such networks for other diseases, so discussion on how to translate this approach to other diseases would be helpful. Some more clarification about decisions on model and approach would help make the paper stronger. Supplementary material figure is not self-explanatory. Statistical tests to show how ADDLE improves performance significantly would help making the paper stronger.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting approach to account for the inter-reader variability for radiological scans. It is an important problem while training deep learning models, which become heavily dependent on the quality of labels. Being able to account/deal with the subjectivity of human reader labels in a quantitative and automated way will help in developing more robust models.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Somewhat confident



Review #3

  • Please describe the contribution of the paper

    This paper addresses the question of learning with uncertain and subjective labels. The proposed idea is to model the tendency of each rater based on a deep latent representation embedding that is then further reinjected in the backbone classification architectures. This allows learning each rater latent embedding during training. During testing, the model can output an aggregated rating (mean or majority voting) based on the performance of each rater. This model is evaluated on a series of about 50 000 clinical US exams for the problem of liver steatosis diagnosis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -a novel formulation to learn a latent representation of each rater tendency and account for rater specific effects on diagnosis performance -This framework opens the path to include large scale subjectively rated data. -The evaluation is performed on a large scale (> 45 000 exams) US exams of liver steatosis patients. -The authors attempt to analyse the embedding space properties, based on PCA analysis. -This method is shown to compare favourably to other SOTA methods

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -Description of SOTA methods is too short, especially references 3 and 8. The paper would gain soundness is some more details could be provided (in supplementary material for instance). -The paper lacks a statistical analysis to confirm that the proposed method outperforms the SOTA methods.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors mention that they will release all available code, which is not mentioned in the core paper. They answered ‘yes’ to almost all questions which is not always correct : the range of hyper-parameters considered and details on how baseline methods were implemented and tuned are not always provided, statistical analysis of the results and analysis of situations in which the method failed are not performed..

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The paper is overall well written and clear. The authors perform comparison with other SOTA methods. The experimental section is well conducted on a large-scale series of US exams. My only suggestion to improve the soundness of the paper is to add some details on the SOTA method and a statistical analysis (see my comment above)

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reason for not selecting a ‘strong accept’ is based on the comparison of the proposed method with SOTA methods (see detailed comment above). The reproducibility check-list does not fully reflect the content of the paper

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Both the approach and the way the problem is treated in this paper are clear, well justified, and rigorous. The limitations noted by the reviewers refer mostly to the choice of datasets and the extent of the overall validation.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1




Author Feedback

Thanks to all reviewers for their valuable comments.

Reviewer 1 mentioned that we should also compare with ‘Difficulty-aware Glaucoma Classification with Multi-Rater Consensus Modeling. MICCAI 2020’ (MRCM). We regret missing this reference. However, the MRCM paper solves a different problem from ours. In the MRCM problem setting, multiple raters annotate the same image, and their solution includes a consensus loss which is only meaningful in this scenario, otherwise this loss will always be 0. In our problem setting and most clinical scenarios, there is only one rater annotating one image. Our solution is able to work with MCRM data, but their algorithm cannot work with our data. Generally speaking, our algorithm solves a broader range of problems. We will include these discussions in the camera ready version.

Reviewer 3 mentioned that we need to provide more details about our comparison methods. Our introduction does discuss the approaches of both [3] and [8], including outlining the design, advantages and disadvantages. In short, [3] trains a separate model for each rater and [8] just trains a separate classification head. Because we test on the same backbone, their implementation details are identical to ADDLE’s apart from the above critical differences. R3 suggests including more details in the supplementary, but because the supplementary material can only include figures and tables, we cannot do this. Nonetheless, we will add as many details as we can, space permitting, in the camera ready version.

Thanks to all reviewers again for their feedback and ideas!



back to top