Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Rong Tao, Guoyan Zheng

Abstract

In this paper, we address the problem of automatic detection and localization of vertebrae in arbitrary Field-Of-View (FOV) Spine CT. We propose a novel transformers-based 3D object detection method that views automatic detection of vertebrae in arbitrary FOV CT scans as an one-to-one set prediction problem. The main components of the new framework, called Spine-Transformers, are an one-to-one set based global loss that forces unique predictions and a light-weighted transformer architecture equipped with skip connections and learnable positional embeddings for encoder and decoder, respectively. It reasons about the relations of different levels of vertebrae and the global volume context to directly output all vertebrae in parallel. We additionally propose an inscribed sphere-based object detector to replace the regular box-based object detector for a better handling of volume orientation variation. Comprehensive experiments are conducted on two public datasets and one in-house dataset. The experimental results demonstrate the efficacy of the present approach. A reference implementation of our method can be found at: https://github.com/gloriatao/Spine-Transformers.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_9

SharedIt: https://rdcu.be/cyl3L

Link to the code repository

https://github.com/gloriatao/Spine-Transformers

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

Applying transformer to 3D medical image detection. An inscribed sphere-based object detector.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Using the latest technique (transformer) to solve the existing challenge (vertebra detection).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The application of transformer seems straight-forward. The InSphere detector seems interesting but itself is still a common image processing method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Results seem reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The dependency of the backbone network is not clear. What if the transformer is relied on another backbone network? What if we perform detection using ResNet50 only?
2. Is the network trained from scratch independently for VerSe2019 and MICCAI-CSI 2014? 80 scans for training (VerSe2019) seems too small for transformer network.
3. How is the U-Net trained? Which data is fed to the U-Net for training?
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Credit is given for the use of latest technique (transformer) and the corresponding modifications (InSphere, refinement regressor) that make vertebra detection possible.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

(1) The authors are the first to apply a transformers-based model to 3D medical image detection task by designing a light-weighted transformer architecture equipped with skip connections and learnable positional embeddings for encoder and decoder, respectively; (2) The authors formulated the automatic vertebra detection in arbitrary field-of-view spine CT as a direct one-to-one set prediction problem and introduced a new one-to-one set-based global loss to force unique prediction and to preserve the sequential order of different levels of vertebrae. The method can reason about the relations of different levels of vertebrae and the global volume context to directly output all vertebrae in parallel; (3) The authors introduced a novel inscribed sphere-based object detector.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is a novel application of transformer in 3D medical image detection.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The title is not appropriate since detection includes localization and identification.
2. The paper only evaluated the identification rate and localization error of the keypoints (i.e., the centers of vertebral bodies), but the paper utilized a object detection network to do this. The proposed method output the coordinates of the keypoints and radius of predicted sphere. The predicted radius is unnecessary for the keypoints detection purpose.
3. The reason why the performance InSphere detector outperformed the classic box detector is not clearly presented in the paper. Why the classic box detector is sensitive to the object orientation?
4. The authors stated that the GIoU loss can increase localization accuracy, but didn’t give an explanation. Moreover, the ablation study results in Table 3 didn’t contain the influence of GIoU loss. Therefore, no evidence supports the effectiveness of GIoU loss.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper didn’t clearly describe how the learnable positional embedding look like. Thus the reproducibility of the paper is poor.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Explaining why you do that is very important for the paper.
Please state your overall opinion of the paper

probably reject (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The motivation and experimental results are insufficient.
What is the ranking of this paper in your review stack?

4
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The authors propose a novel vertebra identification and vertebral center detection based on transformers and 3D Unet for refinement. The proposed methods has several significant original parts such as a detector invariant to rotation and a skip connection from the backbone CNN to the output of the transformers for better computational efficiency. The method is evaluated on 2 public datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is very well written and very clear (remarkable clarity of the explanation of the different loss functions). The art work is also very nice.

The InShere detector is a very clever and simple idea to solve the problem of non-invariance to rotation of box detectors.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Explicitation of some cases when the method fails is lacking.

The ablation study does not seem to be complete.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The statement done by the authors is consistent with the manuscript. The code was not available in the supplementary materials. I guess the authors will release it later.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The L error is quite large when compared to landmark detection for instance in the skull. One could guess that it is because of vertebra misidentification rather that localization error. Is that correct ? If so, in the text, i would find interesting and useful to have an idea of the L-error on well identified vertebrae.
2. The ablation study does not seem to be complete; Why the authors didn’t evaluated the impact of L_GIoU ? Also was it not possible to test the box detector with edge loss and refinement ?
3. In fig 1; aren’t the images for prediction and GT inverted ?
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is of remarkable clarity and bring several contributions in the field.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper is a pioneer in applying transformer into the medical imaging detection or specifically vertebra detection. The paper is overall well-written. The usage of transformer is also very natural as the spine is in sequence. It is similar to the NLP usage which transformer has achieved good success. The usage of sphere coordinate is also very interesting. The reviewers raised some concern on the missing of ablation study on GIoU loss. Some analysis in the failure cases will also be very helpful. The meta reviewer feels the other weakness points raised by the reviewers are minor. Some more weakness, the author highlighted the one-to-one set prediction loss which is very important for vertebra detection task, but that section fails to generate any excitement. Techs are either old or with minor novelty. Second, the author has tried their best for fair comparison by getting a radiologist for annotation which is appreciable. However, the meta reviewer just wants to points it out that this still make the experiments to be unfair comparison, especially when the improvement over state of the art is incremental.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Author Feedback

We thank Meta-reviewer (MR) & all reviewers.

MR, R1, R2, R3: missing ablation study on GIoU loss The efficacy of GIoU loss was previously investigated in [13] and was not counted as our contribution. That’s the reason why we did not conduct an ablation study. InSphere detector and the edge loss are newly proposed by us. We conduct an ablation study to investigate the influence of these losses and the results are shown in Table 3.

MR, R3: analysis of the failure cases Rare anatomical variation, heavy metal artefacts and shape altered by wide range of surgical implants may fail our method. For example, in VerSe2019 test data, the worst labelling result happened to be the only case that did not have L5 attached to the sacrum. As there is no such case in the training data, Spine-Transformers fail to learn such a rare anatomical variation, leading to incorrectly label L4 as L5. Nevertheless, as shown in Table 1 and 2, evaluated on two public datasets, our method achieved equivalent or better results than the SOTA methods.

MR: minor novelty on problem formulation We do think that our formulation of vertebrae detection as an one-to-one set prediction is novel, which allows our method to directly output all vertebrae in a patch in parallel. Such a formulation is essential for training Spine-Transformers. Otherwise, the input patches during training only contain a local view of data and cannot reason about the relations of different vertebral classes and the global volume context, leading to ambiguous detection.

MR: fair comparison with other SOTA methods We recently noticed that VerSe2019 organizers released annotations of test and hidden datasets. We thus re-evaluated our method taking the official annotations as the reference. Our method achieved an average Id-Rate of 97.22% and 96.74%, when evaluated on these two datasets, respectively. In comparison, the best two methods of VerSe2019 reported in [6] achieved an average Id-Rate of 96.94% and 94.25%, respectively, when evaluated on the same test and hidden datasets.

R1: dependency on backbone network? Our method is backbone network agnostic, which is a clear advantage. The reason why we chose ResNet50 was a tradeoff between performance and GPU memory requirement.

R1: detection using only ResNet50? As pointed out in [5], due to lack of global context for patch-based methods, only using ResNet50 or U-net will lead to ambiguous detection and large identification error.

R1: is training from scratch? Why 80 scans is ok? Yes, it is. To improve diversity and to enlarge training data, we adopted patch-based training. Moreover, skip connection also facilitates the training process. Results demonstrated the efficacy of our strategy.

R1: data for training the U-net? For each detected vertebra, we generate a contextual heatmap from the predicted InSphere using a Gaussian kernel, whose standard deviation equals to 1.5 times of the predicted radius. The contextual heatmap is concatenated to the original image. We then crop a sub-volume of fixed size around the detected center, which is used as the input to train 3D U-net for refinement.

R2, R3: advantage of InSphere detector over box detector Due to orientation variation, a vertebral body cannot be well characterized by horizontally/vertically aligned bounding boxes. The advantage of InSphere detector over box detector is shown in Fig. 2 and is demonstrated by ablation study results.

R2: how do the learned positional embeddings (PEs) look like? reproducibility? Encoder: PEs show higher correlation for nearby locations than faraway ones. Decoder: no such correlation found, indicating that unique and disgintuishable PEs are learned for each vertebral class. We will release code soon.

R3: reason for large L-Error Yes, it is because of misidentification. For example, evaluated on a private dataset with 1845 vertebrae from T11 to L5, our method achieved an average L-Error of 1.12mm.

R3: fig.1 images inverted? Yes, we will fix it.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper is a pioneer in applying transformer into the medical imaging detection or specifically vertebra detection. The paper is overall well-written. The usage of transformer is also very natural as the spine is in sequence. It is similar to the NLP usage which transformer has achieved good success. The usage of sphere coordinate is also very interesting. The author addressed most concern from the reviewers. Recommend to accept and it will be good if the author could add the results on Verse2019 test data in the paper if finally accepted.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper revised a quite negative review from one reviewer R2 while the other two reviewers are quite positive. The rebuttal explains why there is no ablation study on GIoU loss. Regarding the clarity, I agree with R3 that the paper is quite readable. Overall, I recommend acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The work has addressed most of the concerns raised by the reviewers. In the final camera-ready version the authors should 1-include the new L-error results reported in the rebuttal, 2- include the new results obtained from the updated VerSeg2019 challenge,3- provide information on how many patches were used during training, 4-discuss the L-Error results of Table 1 especially why [6] archives better localization compared to the proposed method. Looking forward to seeing the complete version of this work at the MICCAI meeting.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

back to top

Spine-Transformers: Vertebra Detection and Localization in Arbitrary Field-of-View Spine CT with Transformers