Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Nicolás Gaggion, Lucas Mansilla, Diego Milone, Enzo Ferrante

Abstract

In this work we address the problem of landmark-based segmentation for anatomical structures. We propose HybridGNet, an encoder-decoder neural architecture which combines standard convolutions for image feature encoding, with graph convolutional neural networks to decode plausible representations of anatomical structures. We benchmark the proposed architecture considering other standard landmark and pixel-based models for anatomical segmentation in chest x-ray images, and found that HybridGNet is more robust to image occlusions. We also show that it can be used to construct landmark-based segmentations from pixel level annotations. Our experimental results suggest that HybridGNet produces accurate and anatomically plausible landmark-based segmentations, by naturally incorporating shape constraints within the decoding process via spectral convolutions.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_57

SharedIt: https://rdcu.be/cyhMA

Link to the code repository

https://github.com/ngaggion/HybridGNet

Link to the dataset(s)

http://db.jsrt.or.jp/eng.php

https://www.isi.uu.nl/Research/Databases/SCR/data.php


Reviews

Review #1

  • Please describe the contribution of the paper

    A landmark based hybrid segmentation network is proposed in this work. In the primary contribution – HybridGNet – two variational autoencoders are trained to reconstruct input images and the landmark based segmentation masks, respectively. Once trained in the VAE setting the encoder of the image VAE and the decoder of the graph VAE are combined into the hybridGNet. This hybrid model is then fine tuned to obtain landmark based segmentation, for a given image. Experimental evaluation on simple baselines shows an improvement. Additionally the work demonstrates usefulness of this method in case of occlusions showing large improvements compared to dense segmentations obtained from Unet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Use of spectral graphs to represent landmark based anatomical segmentation, instead of point distributions or dense segmentations is a useful contribution. As demonstrated in the experiments with occlusions, this can be helpful in overcoming missing data.

    • Graph convolution networks (GCNs) have shown to be useful in learning relationships in graph structured data which are not easily captured using CNNs; this can particularly be advantageous when enforcing shape constraints.

    • Combining CNNs and GCNs using pretrained components from two VAEs can be useful in learning useful features in an unsupervised manner.

    • Experimental evaluation is thorough and the results are competitive compared to the graph encoding baselines used (PCA, VAE) and multi-atlas segmentation method.

    • Robustness study using occlusions further highlights the usefulness of using a landmark based segmentation method.

    • The paper is largely well written with a reasonably complete literature review.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Obtaining graphs to represent landmark based segmentation is a challenging task and the formulation in this work makes one strong assumption that makes it easier. Namely, assuming same number of nodes for all images. While this is a reasonable assumption for reducing the complexity of working with graphs of varying number of nodes, what are the implications of this on different images? Is there a justification for assuming equal number of nodes?

    • The baseline models are well though out. A form of ablation study type baseline that would have been interesting to see is the HybridGNet without the pretrained VAEs. How much would the encoder/decoder suffer, if at all, without this stage of pretraining?

    • The formulation of the dual models is ambiguous. Are there two decoders from the hybridGNet? Meaning the same latent variable is reconstructed into a graph and dense segmentation? How are the labels for this stage of training obtained? So, in some sense there are dense segmentations and landmark based contours available as labels?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors have committed to providing training code and models.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    See comments above.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Combining CNNs with GCNs is performed in a useful manner to obtain landmark based segmentation. However, some aspects (see weaknesses) are not well justified.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    The paper proposes a hybrid CNN and GCN network for landmark-based segmentation. The work uses an encoder-decoder architecture to extract latent features from image space and uses a similar GCN encoder-decoder network to reconstruct and obtain graph landmarks. The paper leverages the advantages of having a fixed number of landmark pixels and connectivity to use the spectral graph convolutions. The experiments on the different architecture of the network, dense segmentation methods, landmark-based methods, and robustness validates the contributions

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Use of CNN and GCN as a hybrid network for landmark-based segmentation is a novel formulation.
    • The authors nicely leverage the fixed number of landmark points and connectivity to use the spectral graph convolutions across the dataset.
    • The comparisons across different baselines and experiments highlighting robustness by adding occlusion are useful in clinical settings.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Is there a pooling strategy used for the GCN encoder-decoder network? If not, how are the latent feature dimension kept constant for CNN and GCN networks? Please discuss this part in the paper
    • Can the model learn if the hybrid network is trained from scratch together and not independently and later finetuned?
    • The results of the paper are majorly reported as graphs and not as a table which is a bit more difficult to read.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have reported making the code and the trained model public for reproducing the results upon acceptance of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Why is the dual path hybrid network with dense maks add not helping in segmentation?
    • From the plots, Fig.3 (a&b) Is there a specific reason for the dice performance not correlating with the Hausdorff distance? i.e, the HybridGNet dual SC has better dice but not the best HD and similar lower performance in robustness experiments?
    • Though the HybridGNet is performing better in Exp 1, why is the model performance not shown for Exp2 and 3?
    • In the robustness experiments, is the occlusion size in the X-axis is the number of pixels?
    • In Contributions (4) that -> than
  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The use of CNN and spectral GCN together for landmark based segmetnation is novel application in medical image analysis.
    • The experimetns validates the claims made in the contribution.
    • The paper is well written and easy to follow.
  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The manuscript proposes to combine a convolutional neural network with graph convolutional neural networks to extract graphs directly from images and do landmark detection and dense segmentation. A number of approaches are proposed and compared with other methods for landmark (contour) detection based on a VAE, PCA, and multi-atlas based and segmentation based on UNet. A dataset of 247 X-Ray images with annotations of lung, heart and clavicles is used for this.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The approach is an interesting variation on landmark-based segmentation/detection and is relevant to the MICCAI community.
    • Good performance is seen on finding landmarks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The exact novelty is not concisely specified and not clearly put in context of prior work. In particular would have liked a sentence or two on what the state of the art is in landmark detection for statistical shape analysis as this appears to be the main purpose of the work.
    • I find exactly what is evaluated a bit unclear (see detailed comments).
    • The dense segmentation evaluation is weak in the sense that performance is only really tested in an unrealistic occlusion scenario.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • I am unsure what exactly is evaluated in some parts of the manuscript. Is dense segmentations produced by the convolutional decoder or derived from the landmark contours?
    • It seems like the authors intend to publish code on acceptance, as an anonymized link is provided.
    • The dataset used appears to be well described in prior publications.
    • The dataset reference [28] does not appear to describe the expert annotations used in this work and how easy it is to get hold of is unclear.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Regarding the training procedure of the hybrid network in which first two networks are trained - an image to image network and graph to graph network - both consisting of encoder plus decoder parts and then finally a network going from image to graph consisting of the pretrained image to image encoder and the graph to graph decoder. This seems like a recipe for catastrophic forgetting if there is little overlap in the encoding between the models. Did you consider training a network with two decoders instead, such that the translation goes from image to both image and graph? This would more explicitly force the two tasks to share encoding.

    • “We can see that not only the UNet, but also the dual models models which incorporate dense segmentation masks during training, degrade the performance faster than the HybridUNet as we increase the size of the occlusion block.” I do not understand this sentence. The Figure 3 does not appear to show results of HybridUnet, only results of the Dual and Dual SC variants and UNet, which all include dense segmentations in training. Even if HybridUnet was included in the figure, perhaps the wording should still be changed, as HybridUnet also includes dense segmentations in its training, in the pre-training phase. The word “models” is also repeated here.

    • “Since we cannot compute the landmark MSE for the dense segmentations, we computed the Dice coefficient converting considering the convex hull of the landmarks.” Aside from the minor grammatical errors please also clarify - do you actually compute the convex hull and use this as the segmentation? Because the shapes are likely not convex.

    • Regarding output in testing/evaluation. HybridGNet predicts graphs, HybridGNet Dual (SC) predicts both segmentations and graphs. Since segmentations can be produced from graphs as simply what is inside the shape bounded by the connected landmarks, it might be good to clarify what is actually used in the evaluation of these methods.

    • Consider extending Fig. 1 with illustrations of the Dual architectures.

    • “HybridUnet” or “HybridGNet”? Possibly just a typo.

    • The paper includes a long list of references for a MICCAI paper. I do not think all of these are needed and maybe it could be shortened to provide room for improvements in other aspects.

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I don’t think the paper has any major weaknesses so my recommendation is based on an overall view of the work. As a whole I find it interesting and relevant, but maybe a little below what I would expect of a MICCAI paper.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    4

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a segmentation method that aims at locating boundary landmarks using a graph convolution network to capture the global representation of anatomical structures of interest.

    One reviewer appreciates the use of GCN in the context of landmark-based segmentation but has a concern on the graph construction and dual segmentation models.

    A second reviewer finds the hybrid GCN+CNN original, has minor questions on the decoding branches of the GCNs and evaluation metrics.

    A third reviewer further appreciates the GCN approach for a landmark-based segmentation, has a minor concern on a possible catastrophic forgetting.

    All reviewers have the general consensus of appreciating the GCN+CNN integration for a landmark-based segmentation but have raised valid concerns, notably on the graph construction, decoding branch, and a possible catastrophic forgetting. These should be addressed to further improve the contribution.

    For these reasons, a general appreciation but a request for clarifications on the GCN, the authors should address these specific points. Recommendation is towards an invitation for a Rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    10




Author Feedback

All reviewers value the idea of integrating GCN+CNN for landmark-based segmentation. In fact, R1 and R2 ranked our work as the best paper in their stack of 5, and R3 ranked it second out of 4. We will now address the major concerns identified by the AC, and clarify the minor points in the camera ready.

= Graph construction = R1 points out that assuming graph construction based on a fixed number of nodes is a strong assumption, and asks about a justification for this premise. There are several reasons that justify this assumption. First, we would like to highlight the fact that we are dealing with landmark-based segmentation of organs. In such structures (eg. lungs or heart), manually annotated landmarks tend to be associated with unique and distinguishable points, indicating anatomical or other characteristic landmarks (eg. ventricular apex of the LV in a cardiac image) which can be annotated in all images. At the same time, as we discussed in the manuscript, landmark-based anatomical segmentation has been historically dominated by statistical shape models, which are normally constructed under the same assumption in terms of fixed number of nodes. But more importantly, having a fixed number of landmarks (nodes in our graphs) makes it straightforward to establish correspondences among the images, enabling population level studies of morphological variations, which is central to computational anatomy.

= Dual segmentation models and decoding branches of the GCNs = R1 calls for better clarification on how the dual models are constructed and trained, while R2 wonders about how the latent feature dimension is kept constant for CNN and GCN networks. For the latter, we simply employ a fully connected layer as a bottleneck, which reduces the dimensionality of the encoder output to a fixed size. The same strategy is used for the CNN and GCN model, both in the single and dual models. Regarding the dual models, as suggested by R1, they have two decoder branches, which are fed with the same latent variable (the output of the encoder). The difference between them is that one reconstructs a graph, while the other one a dense segmentation. Since the dataset only contains graphs, we derive the corresponding dense masks for training by filling the organ contours, assigning different labels to every organ. As suggested by R3, this is not exactly the convex hull, so we will correct this in the camera ready.

= Evaluation metrics = R2 wonders why the Dice metric for the HybridGNet Dual SC is not correlating with the HD distance in the occlusion experiments of Figure 3. This is due to the fact that this model incorporates skip connections (SC), which propagate fine grain pixel level information from the encoder into the decoder. Thus, in the presence of occlusions, the model tends to assign incorrect classes in the occluded areas, which have a heavier influence on the HD distance than on the Dice metric.

= Catastrophic forgetting = R3 asks if the fusion of a CNN encoder with a GCN decoder may not incur in catastrophic forgetting (CF). CF refers to the fact that networks cannot be trained to do new tasks without forgetting the previous ones. This is not the case in our model, since we use the pre-trained weights as initialization and then fine tune on the task of interest, akin to what happens in self-supervised learning where pretext tasks (reconstruction in our case) are used for pre-training. Thus, the fine tuning does not result in catastrophic forgetting since the aim of the pre-training was mainly to provide a good initialization for the subsequent optimization.

Minor points *R3 wonders if we consider training a network with two decoders instead, so that the translation goes from image to both image and graph. This is in fact what we do in the Dual Models. *R1, R2 wonder about the impact of pre-training vs training from scratch. We found that pre-training certainly helps to achieve much faster convergence and reduce the training time.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified their methodology, notably on their graph construction, dual models, and catastrophic forgetting. The clarifications on graph construction and dual models are helpful, as indeed, from a clinical and a practical point, using a fixed number of landmarks is a valid scenario. The dual model is a methodological choice and their added explanation improve its understanding. However, I would disagree on whether initializing from pre-trained weights would necessarily prevent a catastrophic forgetting. There is guarantee on such initialization, unless I am mistaken. Nonetheless, this is not jeopardizing the potential impact of this contribution on landmark-based segmentation. This paper makes advancements in a useful segmentation approach.

    For these reasons, Recommendation is toward Acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    14



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The present paper receives mixed reviews: R1 and R2 are supportive and R3 is negative. R3’s main concern is about the paper’s unclear presentation and hence has difficulty understanding the contributions of the paper.

    I feel that the rebutall does not necessarily answer this concern; however, the authors do clarify some doubts and hence help improve the understanding of the paper.

    Overall, I support the acceptance of the paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work proposes an approach to incorporate a priori statistical knowledge about shapes into an image segmentation approach, through the use of a graph convolutional network that is inspired by statistical shape models (Cootes et al.). In principle, this approach is interesting and the rebuttal clarified issues raised by reviewers. However, I disagree with R1 and R2 in that the experimental evaluation in my opinion is not thorough enough, even considering the fact that a new methodology is proposed. The use of the JSRT dataset on its own allows to draw limited conclusions, since it is regarded as a comparatively simple dataset for segmentation. While worthy of showing a proof of concept, the proposed approach has to be studied on other, more complex segmentation dataset as well, composed of more significant variations, which should be abundant given the large number of segmentation challenges and also the data of statistical shape and appearance model based segmentation from recent decades. Moreover, besides a UNet segmentation result as comparison (which is not described in any detail regarding the experimental setup), the evaluation does not incorporate as a comparison the best results on the JSRT dataset that others have produced. Performance on this dataset in terms of Dice coefficients is very high, making it a somewhat saturated baseline. Moreover, the robustness experiment is very unrealistic, not motivated convincingly by stating “anonymization” as a potential reason for its design without further reasoning, and also disfavors the UNet approach very strongly. However, if a UNet should become robust to such artificial occlusions, one could incorporate these into the training of the UNet in the form of a data augmentation, which would allow a much fairer comparison.

    Overall, my assessment of this work tends towards its rejection in its current form, with the encouragement of following the interesting road the authors have started to walk down and revising/improving the experimental validation of the proposed approach.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    13



back to top