Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Jiancheng Yang, Yi He, Kaiming Kuang, Zudi Lin, Hanspeter Pfister, Bingbing Ni

# Abstract

Modeling 3D context is essential for high-performance 3D medical image analysis. Although 2D networks benefit from large-scale 2D supervised pretraining, it is weak in capturing 3D context. 3D networks are strong in 3D context yet lack supervised pretraining. As an emerging technique, \emph{3D context fusion operator}, which enables conversion from 2D pretrained networks, leverages the advantages of both and has achieved great success. Existing 3D context fusion operators are designed to be spatially symmetric, \ie, performing identical operations on each 2D slice like convolutions. However, these operators are not truly equivariant to translation, especially when only a few 3D slices are used as inputs. In this paper, we propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices. Notably, A3D is NOT translation-equivariant while it significantly outperforms existing symmetric context fusion operators without introducing large computational overhead. We validate the effectiveness of the proposed method by extensive experiments on DeepLesion benchmark, a large-scale public dataset for universal lesion detection from computed tomography (CT). The proposed A3D consistently outperforms symmetric context fusion operators by considerable margins, and establishes a new \emph{state of the art} on DeepLesion. To facilitate open research, our code and model in PyTorch is available at \url{https://github.com/M3DV/AlignShift}.

SharedIt: https://rdcu.be/cyl6v

# Reviews

### Review #1

• Please describe the contribution of the paper
1. Focusing on the “meaningless translation-equivariance” in 3D medical image applications, a new asymmetric 3D context fusion operator, which gives different weights to different slices of a 3D input, is proposed to make up the limitation of 2D network in the weak representation of 3D context.

2. A3D establishes a new state of the art on DeepLesion benchmark with negligible computational overhead.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Existing 3D context fusion operators are designed to be spatially symmetric. Due to only a few 3D slices (3-7) are used as inputs, combining with that convolution-like operations are not truly translation-equivariant, it is not necessary to use spatially symmetric operators in pursuit of translation-equivariance for 3D context fusion. In this paper, the authors propose an asymmetric 3D context fusion operator (A3D), which gives different weights to different slices (depth-wise weights) of a 3D input to fuse 3d context. It’s a simple ‘plug and play’ tool and it works well, with negligible computational overhead.

2. The experiments are sufficient in this paper. The authors not only compared A3D with different 3D context fusion operators, but also previous SOTA methods. And it outperforms existing context fusion operators by considerable margins, establishing a new state of the art on DeepLesion. The description of the experimental setting is clear.

3. The writing of the article is very clear and easy to understand. The graph and table can support the argument of the authors.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– The innovation described by the authors (weights in depth dimension) is similar to attention operation in channel of feature or time-dimension of video data, but it is efficient in 3D context fusion operators.

– The statement of “Note that AlignShift-based model is adaptive to imaging thickness, which is an orthogonal contribution to this study. The asymmetric operation-based methods can be further improved by adapting imaging thickness.” in Sect.3.2 Performance Analysis lacks evidence.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors agrees to share the (new) datasets and the code; I think this will be hepful for the reproducibility of the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

– Different with simple attention operation in time-DIM of video data or channel of feature, the middle slice of our 3-7 slices should with the paramount importance. This is a prior knowledge. If you use it, I think it maybe can simplify the difficulty of learning.

– Add an experiment for the statement of “Note that AlignShift-based model is adaptive to imaging thickness, which is an orthogonal contribution to this study. The asymmetric operation-based methods can be further improved by adapting imaging thickness.” in Sect.3.2 Performance Analysis.

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method looks simple to use and effective. The writing is clear and easy to understand.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

4

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

This work proposes an asymmetric 3D context fusion for the universal lesion detection with learning parameters. The comparison is done with other context fusion method on the DeepLesion dataset. Ablation of different slices are given as well to show the effectiveness.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1 The idea to fuse asymmetric 3D context from adjacent slices using learnable parameters is extensively studied and summarized in this manuscript. Extensive experiments and comparisons with prior state-of-the-arts are done to show the effectiveness of the proposed method. Besides, the code and model are claimed to be made public after publication, which can be a plus.

2 This manuscript is clear written and well-organized to follow.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1 More intuition and reasoning of why the A3D can obtain better results should have been given. I can image more computation resources can potentially have better results. Any other reason and intuition?

2 Besides the sensitivities at each FPs, the exact numeric FLOPS are supposed to be given in Table 2 and Table 3 for the full knowledge of the comparison.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code and model are claimed to be made public after publication, which is good. A few questions and suggestions are given in the weakness section.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Suggestions are given in the weakness section.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Clear description of the details and improvements of the results.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #3

• Please describe the contribution of the paper

This paper proposed a context fusion method to leverage the 2D context from pre-trained networks and 3D context from the backbone networks. The asymmetric 3D context fusion operator (A3D) adjusts weights for feature fusion for different cases. The DeepLesion dataset is conducted to evaluate the performance of the proposed methods, which outperforms the state-of-the-art methods by a certain margin.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposed a new asymmetric 3D context fusion operator (A3D) with a learnable fusion matrix to translate the 3D context into 2D space. Extensive experiments and discussions are conducted to demonstrate the effectiveness of A3D.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The novelty of this paper is limited to MICCAI. The universal lesion detection model is similar to AlignShift [19].

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code and model will be made available online. The implementation details are sufficient.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. How to ensure the 3D context features do preserve when translate to 2D space?
2. How to get the ImageNet pre-trained model based on the current 2D network structures?
3. What are the 2.5D and 3D represented for in Fig. 3? Are they inferences from the model without the 3D fusion?
4. It would be clearer for table 3 to categorize the state-of-the-art methods into groups, such as 2D vs. 3D vs. Fusion.

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A simple yet effective context fusion method method is proposed by asymmetric 3D context fusion operator (A3D) adjusts weights for feature fusion for different cases. The experimental result outperform the state-of-the-art methods for both the fusion methods and lesion detection methods. But the universal lesion detection model followed AlignShift [19], the overall novelty is limited. Therefore, a borderline accepted is suggested.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

3

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors introduced a method to perform 3D context fusion. The authors assume that only a subset of slices (D) of a volume will be inputted into a model. They then apply a dense fc layer DxD layer to all slices in this subset, prior to a typical 2D convolution. This is where they break their translational equivariance. Reviewers generally felt that the idea was interesting and experiments were sound, but felt that some claims were not sufficiently explained and linkages to other approaches (e.g., attention) not sufficiently explored.

After reading the paper, I agree with the reviewers on the interesting nature of the idea and the comprehensive nature of the experiments. Yet, after reading the paper twice, important questions as to claims and justifications still remain. I feel these concerns are important enough to warrant a rebuttal - I look forward to the authors’ response.

The approach seems similar to window-based learning for 3D segmentation (where only a window is inputted to the model at a time), except here the window is only applied in the axial dimension. This brings up a few important comments that I detail below:

1. The authors claim of breaking translational equivariance may be stretched in practice. Indeed, in inference one would wish to apply a detection model to the entire volume. In their scheme, this would require a scanning window approach to fully cover the volume. Applying a DxD FC layer in a scanning window fashion is analogous to a convolutional filter - so in the context of the window the operation may not translationally equivariant, but in the context of the entire volume there seems to be  translational equivariance. Can the authors comment on this and if necessary make their claims more precise?
2. Authors also state that they avoid border/padding effects. But, in inference, if you want to apply their method to the top of the volume (for example), you would still run into border effects. Can the model not be applied to the top or bottom of the volume? This should be made more clear, or the claims about avoiding border effects need to be tempered.
3. The experiments use window sizes in the D dimension of {3,7} on DeepLesion, but it is not made clear how the center slice was chosen in inference and in training to make the axial windows. *This is critical information*. In inference, do you apply a scanning window approach along the axial dimension to cover the entire volume with size 3 windows? Or did experiments assume the center slice was chosen already? (this would make it an incomplete experiment, since in deployment we can't assume the center slice is chosen already). If you use scanning windows, is there overlap in the scanning windows? These details need to be made very clear.
4. By the way, if overlapping scanning windows are needed, this increases computational complexity, and the claims about FLOPs  in Table 2 need to have that caveat made clear. Also, multiplying C DxD matrices consumes C x D^3 FLOPS. Can the authors make clear how they produced the FLOPS numbers in Table 2?
5. What is the sensitivity to the choice of center slice to make the axial window? Would results differ greatly if there was variation in the center slice? If so, this seems a downside. If not, then it seems this is an argument for translational equivariance.
6. Relatedly, in inference, if the scanning windows overlap, how do merge the results?

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

# Author Feedback

Dear meta-reviewers,

We appreciate all reviewers (R1, R2, R3) and meta-reviewers (MR1) for the high-quality reviews and positive feedback. All discussions in this rebuttal and typos will be addressed in the revision.

A. Whole-volume inference (MR1). A1. The DeepLesion dataset. There seems to be a misunderstanding about the DeepLesion dataset used in our experiment. Unlike standard medical imaging datasets with 3D volumes, the DeepLesion provides 3D images but 2D annotations. Specifically, only one key slice is labeled with bounding boxes and long/short diameters. Axial slices within a 30mm range centered around the key slice are provided without annotations, which construct a 3D volume and serve as “3D context” in our models. Since all our experiments are conducted on DeepLesion, we do not need to sample key (center) slices or use overlapping windows. The model outputs are 2D bounding boxes and will be evaluated against the key-slice annotations.

For other 3D datasets that require whole-volume inference for all slices, all 3D context fusion models mentioned in our paper could be applied slice-by-slice in a sliding-window fashion. It is possible for A3D to directly output multiple-slice prediction since it preserves spatial features along the axial axis. However, investigating the application of A3D to those cases is out of the scope of this paper, and we will leave it for future work.

A2. Translation-equivariance. The overall framework could be easily run in a sliding-window fashion to work on the whole volume, which can be regarded as translation-equivariant. Within each sliding window, A3D is not translation-equivariant compared to other operators.

A3. Padding effects. The border effect in whole volumes is inevitable, as it is caused by the sliding windows rather than the model itself. However, the axial padding effect by convolutions could be undesirable even within a single window [10], especially for a limited number of slices. A3D avoids the axial padding effect by design. We will further clarify this point in the revision.

A4. Overlapping window. As mentioned in A1, sliding windows are not needed for DeepLesion since its annotation is in 2D. To apply whole-volume inference with A3D, we can conduct weighted average/blending over multiple predictions with overlap similar to existing methods. The size of overlaps may vary depending on the applications.

We will refine the description of the DeepLesion dataset in the main text and explain how to apply A3D to the whole-volume inference settings in the revision.

B. FLOPs (R2 & MR1). The FLOPs in Table 2 are considered only inside a single module, e.g., HW C_i D^2 (enisum)+DHW C_i C_o K^2 (Conv) for A3D. At each (x,y) in HW-plane, as the fusion weight is applied at separate channels, it is equivalent to DxD dot D for each channel C_i (FLOPs= C_i D^2). As suggested by R2, we also provide the numeric FLOPs of the 3D backbone for 3/7-slice inputs. The additional FLOPs introduced by A3D are marginal given a two-decimal precision. GFLOPs | No Fusion | I3D | P3D | ACS | Shift | A3D Slice=3 | 40.64 | 78.69 | 67.79 | 40.64 | 40.64 | 40.64 Slice=7 | 94.83 | 183.61 | 158.18 | 94.83 | 94.83 | 94.83

C. Comparison with self-attention (R1&MR1). The main difference between A3D and self-attention is that A3D uses static weights to fuse 3D context while self-attention uses dynamic, which is more computation extensive. This is especially the case in terms of memory usage, compared with O(D^2) for self-attention, A3D only consumes O(D) and is thus more memory-efficient. Nevertheless, self-attention-based 3D context fusion is out of the scope of this paper, and yet to be explored in medical images.

D. Comparison with AlignShift (R3). This paper focuses on the 3D context fusion, and we use the detection pipeline in AlignShift. The proposed plug-and-play A3D outperforms existing 3D context fusion operators (including AlignShift) with marginal computational overhead.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I very much appreciate the authors’ response. I do feel that there are interesting ideas here that I would like to see properly explored, but I unfortunately believe that there are serious shortcomings in the analysis that do not support the authors’ claims.

RE: DeepLesion, it is true that DeepLesion only provides 2D annotations for a selected key slice along with 3D context, but that is purely a weakness of the dataset and its benchmark, and is not meant to reflect an actual use-case. Indeed, typical detection methods, even though they are evaluated on DeepLesion, can still operate on a whole volume (even though DeepLesion doesn’t reveal how well these methods would perform on whole volumes). The authors’ method, as presented, can only operate if the key slices for all lesions are already known. That, in and of itself, is a challenging problem and is already half the “battle” in detecting lesions. I would also add that there are works that try to address the DeepLesion shortcomings by fully annotating all lesions in the subvolumes.

As the authors’ state, their method can be applied to a whole volume using a sliding window setup, but this is not explored. But, once a sliding window setup is implemented, the authors’ solution becomes equivalent to just a very large convolutional filter across the axial slices. It is true that within each sliding window, A3D is not translation-equivariant, but the same can be said for any convolutional filter within its “window”. The idea of a large convolutional filter across axial slices is not disqualifying and is potentially interesting, it’s just that the authors’ claims and explanations are not fully accurate. Finally, practicality could become a major issue, since the FLOPs could quite well be exorbitantly high when applying a very large convolutional filter across an entire volume. The authors only calculate the FLOPs within one window, which does not give a picture of how practical their approach is in a real use-case.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

12

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper addresses an important problem. The authors clarified most important aspects raised by the reviewers and meta-reviewer, which can be easily integrated in the camera ready version. I therefore recommend acceptance of the paper.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a clear accept among the rebuttal papers (6,6,7). Positive contributions are all acknowledged. Analyzing deeply into the 3D network architecture for DeepLesion detection is a necessary work and authors did a good job. The work is well motivated, clearly explained, the method proposed makes sense and has been clearly validated compared with several most recent work with improved performance reported.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4