Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Gijs van Tulder, Yao Tong, Elena Marchiori

Abstract

Multi-view medical image analysis often depends on the combination of information from multiple views. However, differences in perspective or other forms of misalignment can make it difficult to combine views effectively, as registration is not always possible. Without registration, views can only be combined at a global feature level, by joining feature vectors after global pooling. We present a novel cross-view transformer method to transfer information between unregistered views at the level of spatial feature maps. We demonstrate this method on multi-view mammography and chest X-ray datasets. On both datasets, we find that a cross-view transformer that links spatial feature maps can outperform a baseline model that joins feature vectors after global pooling.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_10

SharedIt: https://rdcu.be/cyl3M

Link to the code repository

https://vantulder.net/code/2021/miccai-transformers/

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a Transformer model to link different views at the level of spatial feature maps to have a richer feature representation for downstream classification tasks. This model unifies two tasks ( Image registration and Deep Learning classification model) into the same deep learning model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Novel application to combine different views of unregistered medical images, it should be useful to combine features from both views without combining both views using image registration techniques. -Detailed ablation study for single view, late join and cross-view models. -Make use of transformer multi-head attention to link relevant areas between views at feature maps levels, instead of using image registration to combine the views at input level.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The authors should have compared the performance of existing state-of-the-art methods against the proposed method to show the effectiveness of the proposed method. -What classification tasks are considered for the results analysis are not clear.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    -Uses public datasets -It’s indicated that source code will be made public -Supplementary file has detailed model architecture

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    -Compare against existing methods for a fixed task to compare the effectiveness of the proposed method.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, it looks interesting approach to link different views using Transformer model inspired multi-head attention module. It should help extract richer features.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The authors proposed a cross-viewer transformer-based model to process multi-view unregistered images. The multi-head cross attention enables the network to accumulate features from correlated locations from both views. Authors also propose to tokenize features to reduce computational overhead, which in turn improves performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel architecture: The cross-view transformer is an interesting idea and can be beneficial for other medical image problems as well.
    2. Moderately strong evaluation: Authors validate on multiple datasets with an improvement over baseline models.
    3. Well written paper: problem statement and the method is well explained.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Comparison with state-of-the-art is missing in both the dataset. Authors only compared with their baseline models. Comparison with SOTA is also important to appreciate the contribution.
    2. Computational overhead for the proposed cross-view transformer is missing. Transformers are usually computationally expensive. For a fair comparison, one would think of comparing baseline and proposed methods with similar numbers of learnable parameters.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors used publicly available datasets and the methods are described with sufficient details to be able to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. Why the cross-view transformer is used after block two is not clear. Also, why the transformer is only used once is not clear either. More experiments to support these choices are desired.
    2. Why do we need a second transformer for symmetric network formulation? Ideally, two of them should share weights because the task at hand is to compute cross attention between two views, and that should be independent of the choice of source and target selection.
    3. Visual inspection of the cross-attention would be of significant clinical interest to see what does the network learns for unregistered views.
    4. The argument why positional embedding is not required for cross-attention is not clear. Authors should explain either experimentally or via argument what is the reason behind it.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical novelty of this paper is adequate, and the experimental evidence supports it. There are few drawbacks, such as missing comparison with SOTA and computational overhead. However, the contribution outweighs the drawbacks, and hence I recommend probably acceptance.

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The paper presents a cross-view transformer method to enhance the feature presentations between views without image registration.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method achieves better performance than single view based methods and naive late-join based methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper only considers semantic relations of cross-view features. However, geometric information [1] [2] of lesions is not considered in the methodology, which implies the method may suffer from lesion mismatch problem.

    2. More ablations studies and analysis are needed. For example, why does each proposed component work?

    [1] Liu, Y. , Zhang, F. , Zhang, Q. , Wang, S. , & Yu, Y. . (2020). Cross-View Correspondence Reasoning Based on Bipartite Graph Convolutional Network for Mammogram Mass Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

    [2] Ma, J. , Liang, S. , Li, X. , Li, H. , Menze, B. H. , & Zhang, R. , et al. (2019). Cross-view relation networks for mammogram mass detection.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The requires more specific implementation details. It seems to be hard to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    The formula part of the paper needs to be improved. The mathematical symbols are confusing. For example, “f is the number of feature maps” in Section 3.2. Does f represent the channel of feature maps?

  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The model enhances the cross-view feature representations without image registration.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    2

  • Reviewer confidence

    Confident but not absolutely certain




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors proposes a cross-viewer transformer-based model to process multi-view unregistered images. This model unifies two tasks (Image registration and Deep Learning classification model) into the same deep learning model. The cross-view transformer is an interesting idea, and the novel application to combine different views of unregistered medical images. The authors should have compared the performance of existing state-of-the-art methods against the proposed method to show the effectiveness of the proposed method.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    4




Author Feedback

We would like to thank the reviewers for their comments and suggestions. We are happy with the positive reception of our novel cross-view transformer method and experiments on two datasets with unregistered images.

The reviewers rightfully highlight that it is important to compare with existing state-of-the-art methods, although they do not mention which. As discussed in our Related Work, other methods either combine features after global pooling, or use an ROI-based approach to match regions. As far as we are aware, there is not much work that considers cross-view analysis of unregistered images, especially in an end-to-end model.

For mammography classification, the paper by Wu et al. [15] achieves state-of-the-art results. However, this is based on a large, private dataset, which makes it difficult to compare the results with our method directly. As an imperfect solution, we offer our ablation study with a late-join method that is similar to the architecture used by Wu et al. [15].

Our aim in this paper was to evaluate the cross-view transformer as an architecture for cross-view image analysis. Although the experiments do not allow us to show a performance improvement over the state-of-the-art in mammography classification, we do believe that our ablation study allows a comparison between our cross-view transformer model and the late-join architectures that are commonly used in existing state-of-the-art works.

Reviewer #2 mentions the computational overhead of the method. In our experiments the additional computational cost was not prohibitive, but it does depend on where the transformer is applied. We have updated the paper to include some additional information.

Reviewer #2 also thinks the argument against positional embedding is not sufficiently clear, and suggests additional experiments. We will keep this in mind for future work. We do believe that relative positional embedding (as implemented in the original transformer) is problematic across views, because relative distance between unregistered views is not clearly defined. However, some other form of position encoding, or the geometric information suggested by Reviewer #4, might be helpful.

For future work, we will also remember Reviewer #2’s suggestion for a visual inspection of the cross-attention scores, as well as Reviewer #2 and #4’s suggestions for additional ablation studies considering choices like the position of the transformer and the symmetry of the attention weights.

[15] Wu et al. (2019). Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening. IEEE Transactions on Medical Imaging.



back to top