Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Indu Ilanchezian, Dmitry Kobak, Hanna Faber, Focke Ziemssen, Philipp Berens, Murat Seçkin Ayhan

Abstract

Deep neural networks (DNNs) are able to predict a person’s gender from retinal fundus images with high accuracy, even though this task is usually considered hardly possible by ophthalmologists. Therefore, it has been an open question which features allow reliable discrimination between male and female fundus images. To study this question, we used a particular DNN architecture called BagNet, which extracts local features from small image patches and then averages the class evidence across all patches. The BagNet performed on par with the more sophisticated Inception-v3 model, showing that the gender information can be read out from local features alone. BagNets also naturally provide saliency maps, which we used to highlight the most informative patches in fundus images. We found that most evidence was provided by patches from the optic disc and the macula, with patches from the optic disc providing mostly male and patches from the macula providing mostly female evidence. Although further research is needed to clarify the exact nature of this evidence, our results suggest that there are localized structural differences in fundus images between genders. Overall, we believe that BagNets may provide a compelling alternative to the standard DNN architectures also in other medical image analysis tasks, as they do not require post-hoc explainability methods.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_45

SharedIt: https://rdcu.be/cyl4F

Link to the code repository

https://github.com/berenslab/genderBagNets

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper
- This paper used BagNet for the gender classification task from retinal fundus images. By using the interpretable property of BagNet, the authors analyzed the evidence for the gender classification.
- In the experiments, the authors showed interesting analysis for understanding the evidence of the gender classification in terms of DNN.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Overall, the paper is well written.
- This paper shows interesting analyses on the gender classification from retinal fundus images by using BagNets.
- The application of BagNet for gender classification looks new.
- The authors showed interesting results and analysis in the experiments.
- The results might be interesting for the community.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The BagNet was just used from machine learning community. The technical contribution looks marginal.
- The authors used previousely deveoped techniques.
- Using the BagNet is not a new strategy in the medical domain [22].
- Comparison with other approaches (other interpreataion approaches) is limited.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code is not shared but the details for the dataset and experiments have been well written in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- It will be great to clarify what are the new founding compared with [23] where they also reported the optic disc, vessles as a major evidence for the gender classification from the retinal fundus images.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the technical contribution is limited (BagNet is applied in the medical domain without any modification. Even it is not the first time to use BagNet for medical applications), I think that the comprehensive analysis to find the evidence of gender classification looks interesting. This paper shows well-designed application of BagNet to medical domain for understanding the evidence for the new task, which might give the new insight to the community.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

The authors proposed a DNN architecture termed as BagNet that extracts local features from patches and after averages the features extraction per class about all patches. The results presented significant differences for optic disc and macula patches related to gender. The dataset was obtained over 84,000 subjects with 174,465 fundus images from both eyes and multiple visits per participant. In addition, 29 fundus images from patients (11 male, 18 female, all older than 47 years) at the University Eye Hospital with permission of theInstitutional Ethics Board. This DNN naturally provide a saliency map used to highlight the most informative patches in fundus images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The use of a free public dataset is themain strength of this paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The results found in [23] with a trained Inception-v3 on the UK Biobank dataset shown an AUC of 0.97 in the prediction of the patient’s gender and the most relevant areas are the optic disc, vessels, and other zones.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The experimental setup is not clear, making it difficult to reproduce the same results. Why not use the dataset from the related works section instead generate a new dataset?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The use of the data from University Eye Hospital is not clear… Why not use it as a different test set? The final distribution of male and female images from the datasets used after the use of EyeQual networks is not reported. The authors reported several works in the related works subsection but there is not a baseline method from this section used to compare the results against the proposed method. On the other hand, the work [23] using an Inception-v3 trained with 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, reporting an AUC of 0.97 with a major amount of images. Why not use the works from this section to compare your proposed method? The results are not clear. The use of t-SNE is not a good option, I strongly recommend the use of UMAP and other techniques to represent the obtained results.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The lack of novelty and the relevance of the problem.
- The lack of baseline-methods from literature to compare the results.
- The experimental setup is not clear, making it difficult to reproduce the same results.
What is the ranking of this paper in your review stack?

5
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This work focuses on the open question of which features allow reliable discrimination between male and female fundus images. In this paper the authors used a particular DNN architecture called BagNet, which extracts local features from small image patches and then averages the class evidence across all patches. The experimental results showed that BagNet performed on par with the more sophisticated Inception-v3 model, showing that the gender information can be read out from local features alone.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

In this work, the experimental results showed that BagNet performed on par with the more sophisticated Inception-v3 model, showing that the gender information can be read out from local features alone. Also, BagNets naturally provide a saliency map, which can be used to highlight the most informative patches in fundus images. Overall, the authors aim with this work is to open the path to make BagNets a compelling alternative to the standard DNN architectures also in other medical image analysis tasks, as they do not require post-hoc explainability methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The authors claim that, although further research is needed to clarify the exact nature of this evidence, the presented results suggest that there are localized structural differences in fundus images between genders. One major flaw of the paper, in my opinion, is the lack of motivation for the task of gender classification from fundus images. It is reasonable to classify gender from face images, silhouettes, gait or other measures mostly in forensic or surveillance applications. However, I do not see the applicability of this particular one. I would have liked to see that motivated by the authors.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Regarding the experiments, the method is well explained for reproducibility effects. However, in my view this will be highly compromised by the availability of the data.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper is well written, scientifically sound and well organised. The experiments are well conducted and detailed. However, I do not see the applicability of this particular task of gender classification from fundus images, I recommend the authors to motivate their work more convincingly.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written, scientifically sound and well organised. The experiments are well conducted and detailed. In my view a major flaw of the paper is the lack of motivation for the task of gender classification from fundus images. Thus, my recommendation is borderline accept.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
All the reviewers appreciated the novelty of the interpretability approach of the classification network. The work would find its use outside the retinal imaging domain and be of large interest to MICCAI community. Nevertheless, there are a few points that should be addressed in a rebuttal:
- Argue the contribution and the novelty with respect to the alternative interpretability schemes, omitted as baselines.
- report the final distribution of male and female subject after the filtering
- The exact intended role of the University Eye Hospital dataset
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

We thank all reviewers and the area chair for their feedback. We appreciate their constructive comments and the chance to clarify some points.

Datasets (R2): The UK Biobank dataset was used by us and [23], only did we do quality filtering, as image quality was often poor. Korot et al. (2021, Sci Rep) used also only the UK Biobank data and obtained ROC comparable to ours. The EyePacs dataset used in [23] is not publicly accessible. Its public version is the DR Kaggle challenge dataset, which does not have gender info. We used the images from our hospital as an independent test set to assess generalization. We will clarify this in the text.

Open code and data (R1, R3): We are committed to sharing code and data openly. We did not include a GitHub link in the manuscript to preserve anonymity. Of course, all code will be publicly available on GitHub. Due to the MTA of UK Biobank, we can not share the data, but researchers can apply for access and subsidized fees are available. Thus we disagree with the notion that data access significantly compromises our work.

Motivation of the task (R2, R3): Beyond assisting physicians, one hope for AI in retinal imaging is to unravel biomarkers that are not easily found by humans. In that sense, this is a proof of principle study for the open scientific problem of gender differences in eye anatomy that recently has created lots of interest in the retinal imaging community and yet does not have a decisive answer. The BagNet can serve as a hypothesis generator for differences between the two genders that will need to be tested further using other imaging modalities. Also, the framework will be applicable to disease diagnostics, where it will improve interpretability, and may help to discover biomarkers. We will emphasize both points in the revised version.

Novelty & contribution (R1, R2): Our contribution is twofold: (1) We introduce BagNets as an interpretable by design architecture for image analysis in ophthalmology and (2) use them to narrow down the hypothesis space for the important question of sex differences in eye anatomy. Regarding this question, [23] reports that optic discs, vessels and non-specific features are relevant to DNN decisions. This seems to be the case for almost all the dependent variables in [23] so it is very hard to derive testable hypotheses for gender specific differences. Our study adds to that substantially: We show that (1) local features are sufficient to classify gender and (2) the optic disc and macular region are most informative about gender across the population (through average saliency maps). Also, (3) the “most male” and “most female” patches allow us to identify potential regions of interest. In contrast, [22] is a preliminary study in histopathology, which also uses BagNets, but to different purposes.

Benchmarks & comparison (R1, R2): For an accuracy baseline for the BagNet, we used InceptionV3, which was also used by [23]. We achieved ROC comparable to Korot et al. (2021), which exploited Neural Architecture Search. Benchmarking saliency maps is harder, but we will show examples derived with classical saliency methods and discuss more explicitly the advantages of BagNet-derived saliency maps. Most importantly, logit maps returned by BagNets are interpretable as is, while standard saliency maps require fine tuning and post-processing.

Class priors before/after filtering: (46% male, 54% female) vs. (47% male, 53% female).

Use of tSNE instead of UMAP (R2): It has been recently shown that the most peddled difference between the two algorithms, the supposedly better global layout in UMAP (Becht et al., Nat Biotech, 2019), can be accounted for by differences in initialization of the default implementations (Kobak & Linderman, Nat Biotech, 2019). For emphasizing cluster structure, tSNE with a heavy tailed kernel is in fact the method of choice (Kobak et al, ECML, 2019; Böhm et al, arxiv, 2020).

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors clarified in the rebuttal the roles of the datasets used and the code release issue. It is also clear that the task of gender detection is used as an example for a subclinical biomarker discovery task and is not a purpose in itself. The contribution of using BagNets to reduce the feature contribution hypothesis space is meaningful.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

According to the reviews and AC comments, this is a typical borderline submission. Basically, the reviews are very consistent although the direction to the borderline is slightly different. The topic is interesting and new to the community, however, the technical novelty is very limited. The experiments are well-conducted especially the explanation of the findings, however, the clinical/biological evidence is absent. With the strong support from the primary AC, I agree to accept this paper for a wide discussion at MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

back to top

Interpretable gender classification from retinal fundus images using BagNets