Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Khaled Saab, Sarah M. Hooper, Nimit S. Sohoni, Jupinder Parmar, Brian Pogatchnik, Sen Wu, Jared A. Dunnmon, Hongyang R. Zhang, Daniel Rubin, Christopher Ré

Abstract

Deep learning models have demonstrated favorable performance on many medical image classification tasks. However, they rely on expensive hand-labeled datasets that are time-consuming to create. In this work, we explore a new supervision source to training deep learning models by using gaze data that is passively and cheaply collected during a clinician’s workflow. We focus on three medical imaging tasks, including classifying chest X-ray scans for pneumothorax and brain MRI slices for metastasis, two of which we curated gaze data for. The gaze data consists of a sequence of fixation locations on the image from an expert trying to identify an abnormality. Hence, the gaze data contains rich information about the image that can be used as a powerful supervision source. We first identify a set of gaze features and show that they indeed contain class-discriminative information. Then, we propose two methods for incorporating gaze features into deep learning pipelines. When no task labels are available, we combine multiple gaze features to extract weak labels and use them as the sole source of supervision (Gaze-WS). When task labels are available, we propose to use the gaze features as auxiliary task labels in a multi-task learning framework (Gaze-MTL). On three medical image classification tasks, our Gaze-WS method without task labels comes within 5 AUROC points (1.7 precision points) of models trained with task labels. With task labels, our Gaze-MTL method can improve performance by 2.4 AUROC points (4 precision points) over multiple baselines.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_56

SharedIt: https://rdcu.be/cyl27

Link to the code repository

https://github.com/HazyResearch/observational

Link to the dataset(s)

https://github.com/HazyResearch/observational


Reviews

Review #1

  • Please describe the contribution of the paper

    The work reports using eye gaze data for observational supervision. They use eye gaze in supervised and unsupervised settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is in my opinion good bit of work focusing on a modality that should be taken seriously (eye gaze data). The methodology described is fresh, although it could be a bit simplistic specially with statistical features.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    My main concern is that the actual coordinates information in eye gaze data is somewhat underused. As the authors in [13] claim, one can use eye gaze data in UNet like architectures. The authors here use helper tasks. It seems at least a discussion of other possibilities is warranted.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Two datasets are claimed to be slated for public release. A third party dataset is already in public domain. I am optimistic about the reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Please define the helper tasks a bit more clearly. Currently this is vague in the paper.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think despite the limited computational contributions, this is a fresh kind of work that promises the possibility of training interpretable classifiers without need for localized labeling.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    The paper “Observational Supervision for Medical Image Classification using Gaze Data” is about the usage of visual attention as annotation source for X-Ray images. They claim that the fixation duration is a strong indicator for a valid label.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Positive:

    • Interesting approach
    • Interesting findings
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Negative:

    • Used three experts for one dataset and one expert for another.
    • Use other public datasets with gaze information for evaluation.
    • Claims in “Gaze data statistics” are not empirically shown.
    • Evaluated only one model.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Not possible due to the hold back data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Required improvements:

    • Use more experts for the recording
    • Evaluate your claims in “Gaze data statistics” empirically
  • Please state your overall opinion of the paper

    reject (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall to few experts for recording and the claims are not evaluated.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    4

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The paper focuses on the use of gaze data from medical imaging review as support for imaging classification. The authors propose to extract several gaze features (namely “Time on maximum patch”, “Diffusivity”, “Unique visits”, “Time spent”) from the gaze data and train a neural network to predict these features that are later converted to the probability of binary classification. Moreover, the authors propose to train the neural network to predict both, binary class as the main task and gaze features as secondary task aiming for improved performance on the main task. The authors propose an evaluation on several medical imaging classification tasks to prove their intuition, in particular showing that training a network on multiple tasks allows for higher performances than a baseline binary classifier.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a respectful amount of experiments to verify and prove the intuition behind the proposed method. Several datasets have been used with different protocols for collecting the gaze data. Several ablation studies are described, in particular, evaluating the contribution of the proposed gaze features and the results are given in supplementary material.

    Overall the paper is well written and the descriptions of the method are clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Unfortunately, the clinical feasibility of the proposed method remains the main concern. That is, the collection of the gaze data in the clinical environment may be burdensome and the obtained data may be too operator-specific. That is, the clinical review of the radiological imaging substantially varies, with the hanging protocols usually also operator-specific (e.g., one or several screens being used, displaying one or several images on the screens at a time, various order of the displayed images, etc.). This makes the implementation of the proposed method doubtful and eventually losing compared to NLP methods based on clinical reports.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors describe well the datasets being used

    The authors give details of the neural network being used in the experiments

    The authors state the training process settings and eventual pre-training (i.e., 15 epochs on ImageNet)

    The results are given over several runs with the respective standard deviations

    The codes are provided in supplementary material

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. In Table 1 the authors list the gaze features used in the method. It would help the reader if the authors could give the types, the ranges of the features, preventing the reader from looking in the supplementary material.

    2. In 3.3 the authors give the equation of the loss function being minimized, with l_j, j=1..m being the losses for the secondary tasks. Could the authors clarify in section 3.3 or later in section 4, which losses have been used for each task (e.g., MAE, MSE, etc.)?

    3. Similarly, in the equation in section 3.3, there are no weights on any of the losses. It would be helpful if the authors clearly state the lack of weights, or their presence, and the argument behind the adopted approach.

    4. In 4.2 the authors discuss the usefulness of the features, further detailed in the supplementary material. I wonder whether the authors could discuss or add numerical results on experiments having a different number of secondary tasks (i.e., using a different number of features)? That is, from table S.2, one can see that time plays little role in the performances. What would be the effect of removing the time from the scope of secondary tasks?

    5. Both models, namely Gaze-WS and Gaze-MTL are built to predict Gaze features. Could the authors discuss or give numerical results on the performances on these secondary tasks?

    6. In Related work, the authors talk about NLP-based methods that could be used conjointly with the proposed method. Such methods are generally serving the same purpose, eventually allowing for improved performances. It would be useful for a better understanding of the results of the proposed method if the authors could discuss (e.g., list the result reported in other works) the gains in performances in similar tasks of the NLP-based methods, to illustrate, whether the gains are comparable.

    Minor observation:

    1. The Equation in 3.3 is not referenced. Could the authors put a reference on it (it should be (2))?
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is generally well organized, clear, and pleasant to read. The authors describe well the proposed method and do a decent job of evaluating on several datasets. However, the uncertain clinical feasibility remains from my point of view the main disadvantage of the method: that is, the collection of the Gaze data amongst the clinicians (e.g., radiologists) may be a considerably challenging task, more burdensome than, for example, the collection of clinical reports mentioned in the paper. This prevents me from firmly accepting the paper.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #4

  • Please describe the contribution of the paper

    The main challenge in deep learning methods for medical imaging is the lack of high quality labelled data for a variety of tasks. This paper hypothesis that gaze data contains latent information regarding the actual task at hand and proposes to use gaze data to weakly supervise deep learning models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1)The idea of incorporating gaze tracking data to weakly supervise trained models is novel and interesting

    2) Strong evaluation with clearly defined baselines and experiments. The authors compare a CNN model trained with gaze tracking data and compare it against a CNN model trained using high quality labels. Additionally, the authors present clear comparison of the Gaze-MTL model with several gaze data incorporation methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Gaze tracking requires specialized hardware and software, which may be expensive to scale and sometimes be impractical in clinical settings.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have provided the modeling code with clear documentation. They also plan to release their gaze tracking datasets after the review process. Based on this, I believe this work will be reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    It’s a very interesting idea to use gaze tracking data as weakly supervised labels. From a research standpoint, it would be interesting to see how much of the gaze tracking hotspots actually overlap with abnormalities in the image (whenever labeled data i available). This can inform the noisy-ness / quality of the gaze tracking labels.

  • Please state your overall opinion of the paper

    strong accept (9)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of using observation information such as gaze to generate weakly supervise medical imaging is an interesting and novel idea. The paper is well written and the authors present a clear evaluation of their methodology.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    1

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers have favorably assessed the work in question. We ask the authors to go through the comments made by all the reviewers to improve the overall quality of the paper.

    In particular, some of the claims made could be re-evaluated due to the fact that data was only collected by a single expert. Similarly, the authors should make not of the practicial implications their solution has and how it compares to the NLP based approaches.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

We thank reviewers for their helpful feedback. We are pleased they found our approach novel, our paper well written, and our experimental results compelling, and that they appreciate our contribution of two novel datasets that will be publicly available. Below, we address the main points brought up by the reviewers.

Practicality of gaze collection (R3,5, MR): As medical AI tools continue to advance and wearable technology (e.g. AR with eye tracking capability) improves, the required hardware needed to collect gaze data is expected to become more ubiquitous, affordable, and standardized [1*] (R5). In this context, straightforward strategies could be developed to mitigate the impact of clinical protocol variations on the utility of gaze data – for instance, a wearable’s frontal camera data could be used to keep track of which gaze data belongs to which image (R3). Indeed, we hope that by demonstrating the usefulness of gaze data for medical ML supervision, our work will also inspire further development of efficient approaches to large-scale gaze data collection.

NLP Annotators (R1,3, MR): Since our datasets do not contain medical reports, we could not do a direct comparison with NLP annotators (R3, MR). However, we would like to emphasize that we view observational supervision as a complement to NLP annotators. Firstly, in situations where text reports or trained NLP models are not available, we show that gaze data is a viable alternative source of supervision. Second, if NLP-based labels are available, combining information from the gaze signals may improve label accuracy (since gaze may contain information beyond the reports, e.g. expert confidence), or the gaze data may be used alongside the NLP-derived labels to further improve performance, e.g. via the Gaze-MTL method we propose (R3, MR). While we mentioned these points on Page 3 of our original manuscript, we will emphasize them in the revised manuscript. As suggested by R1, we will expand our related work section to include recently published techniques by Karargyris et al. [13] (R1).

Helper tasks in Gaze-MTL (R1,3): A helper task is defined to be the task of predicting the value of a gaze feature for each training image (R1). The loss function used for all helper tasks is the standard soft cross-entropy loss [23]. The equation in Section 3.3 should include different weights on the helper task losses, as these hyperparameters may impact positive task transfer in MTL (R3). In our experiments, we chose each weight from the set {0, 0.5, 1, 2} that achieves the highest validation accuracy (R3). For experimenting on a different number of helper tasks, we found that in three of our medical tasks, tuned weights corresponded to choosing a single helper task (i.e. weight of zero was given to all but one helper task). The chosen helper tasks are those that are bolded in Table S.2 (we will include our hyperparameters in the code and revised manuscript) (R3). We will include the helper task performances in a new column of Table S.2 (R3). We will add a column in Table 1 summarizing the range of the features shown in Figure S.1 (R3). We will integrate these clarifications in the revised manuscript.

Reproducibility, datasets, and claims (R2, MR): We emphasize that both our code and datasets will be publicly released, making it possible to reproduce our results (as pointed out by R1,3,5) (R2). We also emphasize that we collected gaze data on three highly-trained radiologists for CXR-P, an expensive effort (MR). While having one expert for METS is a current limitation, we hope our work inspires the community to contribute more gaze datasets from multiple experts (R2). Our claims in the “gaze data statistics” section are empirically supported by Figure S.1 (R2). Lastly, we used a standard CNN architecture (ResNet50) commonly used for our tasks in the literature [4, 31] (R2).

[1*] M.R. Desselle et al. “Augmented and virtual reality in surgery.” Computing in Science & Engineering. 2020



back to top