Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Junjun He, Jin Ye, Cheng Li, Diping Song, Wanli Chen, Shanshan Wang, Lixu Gu, Yu Qiao

Abstract

Recent studies have witnessed the effectiveness of 3D convolutions on segmenting volumetric medical images. Compared with the 2D counterparts, 3D convolutions can capture the spatial context in three dimensions. Nevertheless, models employing 3D convolutions introduce more trainable parameters and are more computationally complex, which may lead easily to model overfitting especially for medical applications with limited available training data. This paper aims to improve the effectiveness and efficiency of 3D convolutions by introducing a novel Group Shift Pointwise Convolution (GSP-Conv). GSP-Conv simplifies 3D convolutions into pointwise ones with 1x1x1 kernels, which dramatically reduces the number of model parameters and FLOPs (e.g. 27x fewer than 3D convolutions with 3x3x3 kernels). Naïve pointwise convolutions with limited receptive fields cannot make full use of the spatial image context. To address this problem, we propose a parameter-free operation, Group Shift (GS), which shifts the feature maps along different spatial directions in an elegant way. With GS, pointwise convolutions can access features from different spatial locations, and the limited receptive fields of pointwise convolutions can be compensated. We evaluate the proposed methods on two datasets, PROMISE12 and BraTS18. Results show that our method, with substantially decreased model complexity, achieves comparable or even better performance than models employing 3D convolutions.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_5

SharedIt: https://rdcu.be/cyl3H

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper proposed a Group Shift operator for 3D Medical Image Segmentation problem. Given the 3D data, the number of parameters becomes exponential while applying CNN framework. The paper tries to reduce the number of FLOP by performing granular approach of convolution. The results presented on two different datasets for experimental verification.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The novel Group Shift operator is proposed in the paper.
2. Hence the number of parameter goes down drastically.
3. The experiments are performed on two different dataset to validate the approach.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. A more detailed analysis would help in terms of understanding of operator at large.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
1. The paper can be reproducible in terms of framework is concerned.
2. It seems that BRaTS2018 validation and test data are not used for additional comparison. Need to add more on what proportion of data is used as training/testing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The proposed framework reduces the computation burden while working with 3D imaging. However, a brief on computational complexity would give more insight to validate the approach.
2. The tensorization approach is more common in Image and Video Processing. The paper must address that how it different or it can also be applied to Video data as well.
3. The paper can be extended to Video Data Processing where the challenges would be different. I wish if authors can extended as another paper or look into it.
4. It would beneficial if the proposed operator can also take care anatomical structure or information into account like Non Local Self Similarity.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The reduction in number of computation in 3D data is big advantage
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

In this paper, the authors tackle the issue that 3D Convolutional Neural Networks (CNNs) have much more parameters and FLOPs than their 2D counterparts. The number of parameters makes it hard to train these networks with smaller amounts of data, while the FLOPs pose computational issues. Pointwise convolutions alleviate both problems, but they cannot gather context information properly. The authors propose a Group Shift Pointwise convolution operation. The feature maps are split into blocks in the spatial dimensions and these blocks in some channels are shifted. In this way, pointwise convolutions can have access to features from other locations. The operation adds no extra parameters. The proposed method is evaluated in public MRI datasets for prostate and brain tumor segmentation tasks with good results. The proposed method can be on par or surpass fully 3D models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The text is well written and organized. The authors motivated well the proposed work in terms of current issues of 3D CNNs and naive replacement of convolutions by pointwise ones. However, please, double-check the reference format, which does not seem aligned with the official guidelines.
- To the best of the Reviewer’s knowledge, the proposed Group Shift Pointwise convolution is novel.
- Results are surprisingly good. In some cases, the proposed method is on par with 3D approaches, or can even improve.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors utilize a small FCN for this work. However, it would be interesting if the proposed method allows for training deeper networks that are otherwise hard or impossible to train in a fully 3D architecture due to the limits on the amount of data. If it does allow for it, it would be a strong point in favor of the proposed work.
- The authors report all results of the different combinations of parameters in prostate segmentation, but not in brain tumor segmentation. Also, those results are obtained in the validation set, which may be overly optimistic due to a potential validation set overfitting. In the case of brain tumor segmentation, the authors show in the end a comparison with SotA in the official validation set, which the Reviewer appreciated.
- The method seems sensitive to the hyper-parameters, such as the number of spatial groups and in which stage of the network. Since we have 3 spatial dimensions and 5 stages in the network, this suggests that there is a large number of combinations. It would be more appealing if the authors could come up with some kind of rule.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors use publicly available datasets only, which is appreciated, but due to the training-validation split, it may not be possible to reproduce the results. However, in the Reproducibility Response, the authors state that the code will be released. The main hyper-parameters are reported, such as learning rate, but the batch size is not. The Reproducibility Response contains answers that are not accurate. There is no mention of the computing infrastructure in terms of software and hardware, memory footprint, or average runtime of the experiments.

Reproducibility may be possible to some extent, but some information is missing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

General

The authors propose a method for dealing with the huge increase in parameters and FLOPs in a 3D CNN as compared with their 2D counterparts. The motivation is valid and it is indeed a serious issue in 3D medical imaging modalities, such as MRI or CT. Moreover, these datasets tend to be small, which raises problems due to the increased number of parameters and training these networks. The authors propose a scheme to allow for pointwise convolutional layers to gather information from other regions of the images, such that the context loss of these layers is mitigated or bypassed. The proposed method is novel and the results are encouraging.

Comments/questions to the authors (not in order of importance)

1) The authors mention that large filters can be decomposed into smaller ones, and cite [1] for that. But, the actual work that first proposed such decomposition is [2]. Please, adapt accordingly.

2) The text is mostly well written and organized. This Reviewer would like to acknowledge the author’s effort in doing so. Perhaps something that can be improved is the section with equations 2-8, however, could have some more explanation that guides the readers. As it is, the equations are dropped there without much more explanation. The math can be followed, but it is a bit dense. The reviewer also appreciated the research question and a clear enumeration of the problems of pointwise convolutions. There is a typo in the caption of Table 2 and section 3.2: PROMISER12 -> PROMISE12.

3) The authors mention that “When all the input feature channels are allowed to shift, the segmentation results (mDice=85.6%) are much worse”. The reviewer believes that the written mDice is not correct, as it is the best of Table 2. Is it correct? If yes, please correct.

4) The authors state that shifting happens in the same direction, as observed in Fig. 1-c. This means that the context information that is being shifted will be always from the same location to a given voxel. Would it be beneficial to somehow randomize some of the shifting operations?

5) The authors use a tiny 3D FCN in the context of this work. The Reviewer believes that it is done as a proof of concept, which is ok. But, why didn’t the authors follow up an experiment with a higher capacity network? This would be interesting especially to compare with SotA in Table 3. This Reviewer believes that it could show the benefits of the proposed method further. If we believe that the number of parameters is a real issue when using 3D data, we must see some performance decrease from deeper models due to a large number of parameters and difficulty in training. In principle, the proposed method should be able to allow training of deeper network compared with fully 3D, or, at least, allow for a more graceful performance decrease with the number of layers and depth. Could the authors comment on this?

6) The authors split the data into training and validation, as 80% and 20%, respectively. This means that the results of Table 1. and Table 2. are from the validation set, where the hyper-parameters are tuned. In this sense, some of the results may be overly optimistic due to overfitting to the validation set. At least in the case of BraTS, it seems possible to define a test set due to the higher number of training images; e.g. a 70%-15%-15% split. Could the authors comment on this, please? a) Nonetheless, in the case of BraTS, the authors indeed use the official validation set with blind annotations for comparison with SotA, which the Reviewer appreciated. Could the authors show the results of the fully pointwise model and 3D model in Table 3, too, please?

7) The authors report all the results for all combinations of parameters in the prostate dataset, but in BraTS, all results are reported in the text, as mDice. So, we cannot see how the different setups affect each of the classes of the brain tumor. It would be interesting to see those results, at least as supplementary material.

8) The authors check how many spatial groups are better in each stage of the network, to get the best performance. This is shown in Table 2 and the text of section 3.2. a) The authors report the spatial groups in the form of tuples, e.g., (2, 2, 2). It is fair to assume that it refers to (D, H, W), as defined for feature maps. But, the authors should define it clearly. Also, the Reviewer believes that the numbers refer to the number of groups, not to the number of pixels in each group. Is it correct? Please, make it clearer. b) The authors tune the number of groups in each dimension for each stage of the network. The best configuration for prostate segmentation is quite different from brain tumor. The Reviewer believes this is due to the different sizes of the images. Is there any rule of thumb regarding how these parameters can be set? At the moment, it seems that the search space is quite large, due to the multiple dimensions and stages. So, the method may not be so appealing. It would be very interesting if there were some observations regarding this problem. For example, when one does filter decomposition [2], a very nice property is that we can just repeat the defined block and it usually works well.

9) The authors observed that in BraTS, adding the GS operation to the decoder only is better than adding to both the decoder and encoder. It would be interesting to see further investigations on this. The current hypothesis does not seem too convincing, without any citation to support the claim that the encoder extracts low-level localization features. Actually, by the end of the encoder, the features should be quite high-level, and that is why long skip connections exist for segmentation, i.e., to bring back the low-level details while up-sampling the feature maps [3]. Can the authors comment on this, please?

10) It seems that the References do not follow the usual guidelines for Springer. The Reviewer believes that references should be sorted alphabetically by the family name of the first author. The Reviewer kindly asks the authors to double-check this and adapt if needed.

Further comments (suggestions/extra comments on future work) - NOT intended to be addressed during rebuttal

1) As mentioned, the introduced hyper-parameters may not be so appealing in the proposed method, as they can differ with the dimensions and stages within the network. In future work, it would be interesting to see if a sensible soft rule exists. For example, if the number of groups can be defined in relation to the size of the dimensions.

References

[1] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [2] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). [3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

3D CNNs tend to be harder to train, from both data and computational point of views. This paper proposes to tackle such issue by simplifying the convolutions into pointwise ones, but still providing contextual information. The Reviewer believes that the method is novel, and while more experimental results could have been provided, the community can benefit from knowing about the proposed work.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The paper proposes group shift (GS), an operation that basically rearranges the feature maps within a CNN. With GS, pointwise convs (1x1x1 in 3D) can be used in a U-Net without significant performance loss due to, e.g., missing receptive field but with considerably fewer number of parameters and FLOPs. The method is evaluation on PROMISE12 and BraTS 2018 and compared to several baselines.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is well-written. The method is well described, and the figures are supportive and nicely drawn. The methodology is an interesting concept
- Experiments on two public datasets
- Extensive experiments on the positioning and type of GS
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The method is not superior to other methods but requires a lot of manual tuning (spatial group setting, position and type of GS, C_S, ….). This might really limit the applicability of the method on one might just choose to use a standard 3D U-Net like the nnUnet by Isensee et al.

Further weaknesses:
- Evaluation: no variability of the results is reported (boxplots, standard deviations).
- The discussion is weak. At least the aspect that the fewer params and FLOPs come at the cost of a heavy additional tuning (C_S, position of GS, encoder/decoder differently) without considerable performance gains is worth discussion.
- Related work: I think there could be more references to related work, such as maybe Wu et al. CVPR 2018 1711.08141 and others?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method is not reproducible as no public code seems to be provided (no (blinded) link or similar in paper). A lot of questions in the checklist are answered with “Yes” but actually not present in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please add the baselines of Table 1 to Table 2/3. Otherwise, the reader has to go back to Table 1 to see the baseline result.

As one of the benefits of using pointwise convs are the reduced number of parameters, the proposed method could maybe work on the entire images (not patches through random cropping)? This might benefit the performance. I think it would be valid to make use of this advantage in the comparison.

Good that the extreme case C_S = C is showed. A more detailed sensitivity analysis on C_S would be nice (maybe supplementary files). Further, there is no clear trend visible for the spatial group settings, which makes it difficult for other researchers to start using the method. Maybe the discussion could elaborate on best practices and recommendations?

Fig 1: the horizontal space between a), b), and c) could be enlarged slightly for better visual separability. Fig 2: I don’t know if this figure is really necessary, as it is just a generic U-Net. Could also be in supplementary material. Eq. 6: comma missing Typos FCNsd, PROMISER12
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes an interesting methodology to use pointwise convs (1x1x1) by rearranging feature maps through grouping and shifting in a U-Net. The benefits are fewer learnable parameters and FLOPs compared to standard 3D CNNs with comparable, although not superior, performance. My main concern is the heavy tuning and number of parameter choices that come with the utilization of the method.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes to address the computational complexity of 3D convolutions kernels by introducing a group-shift operator.

One reviewer highlights the computational advantage of the proposed group-shift operator.

A second reviewer also appreciate the computation advantage, but would like to further test the limits of the proposed pointwise convolution operator.

A third reviewer further appreciate the computational benefit of the operator but has concerns on the “heavy tuning and nb of parameter choices”.

All three reviewers have a clear consensus on the technical contribution of the group-shift operator in enabling fast 3D convolutions, which is crucial in medical imaging. The reported results indicates a performance at par with a standard 3D convolutions, but added experiment testing the limits of the proposed group-shift operator would be beneficial to grasp when to use it, as well as a discussion on the increase hand tuning and number of parameters.

For these reasons, a potentially high impact, but missing a discussion on the operator limitation in 3D as well as on the changes in the number of parameters and its tuning, would further help a decision. Recommendation is towards an invitation for a Rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

9

Author Feedback

Dear Area Chair,

Thank you very much for organizing the review process and sending us the reviewers’ comments. We appreciate a lot that you gave us the opportunity to do a rebuttal. Thanks also to the three reviewers for their valuable comments and suggestions. Two major concerns are addressed in this rebuttal. The corresponding responses are summarized as follows.

Comment 1: The limits of the proposed group shift operation should be investigated with additional experiments utilizing deeper networks. Response: Thanks to the reviewers for giving this important recommendation. We want to clarify that reducing the model parameters of the baseline (a tiny 3D U-Net) is not our ultimate goal. Instead, our main target is to build a model that can achieve fast inference with satisfactory performance when compared to state-of-the-art methods rather than the baseline. In our submission, we instantiated our idea (pointwise FCN with group shifting) with a tiny 3D U-Net architecture for three major reasons. Firstly, the tiny 3D U-Net on its own is already a lightweight model. We were eager to see if we could further reduce the model complexity substantially without compromising the segmentation accuracy. Secondly, the tiny 3D U-Net model has been utilized in many related studies (e.g. S3D-UNet, 3D-ESPNet, NVDLMED, etc.). Therefore, we believe that it can serve as a very good starting point for our study. Besides, utilizing the same baseline can ensure fair comparisons between the different methods and can help generalize the abundant methods in the field. Finally, with deeper networks, it would be more difficult to investigate the effects of different configurations (more combinations of hyper-parameters occur). On the other hand, we agree with the reviewers that adopting more complex networks is a way to further validate the benefits of the proposed method. We have also experimented with the classical 3D U-Net (2x channel numbers compared to the tiny 3D U-Net) and the results confirmed the superiority of the proposed method. Currently, we are still trying to automate the process of finding the optimized configurations for our proposed method (more details in the response to Comment 2), with which deeper networks can be readily employed and thoroughly evaluated.

Comment 2: The issue of the heavy tunning of the hyper-parameters (number of spatial groups, inserting locations, etc.). Response: Sorry for missing the discussion on this important issue. We will make the corresponding modifications in our final paper. We must admit that we are still not making the most of the capability of the proposed group shift operation. As have noticed by Reviewer 2, the best configuration of our method depends slightly on the dataset. To this end, in our following work, we will investigate the effects of the imaging modality, spacing, volume size, and the target object size on the choice of the best model configurations. We will also try to design dedicated network architecture according to these properties of the dataset. Particularly, the number of stages, the number of channels in each stage, the number of convolution operations in each “Conv Block”, and the number of “Conv Block” in both the encoder and the decoder will be accordingly optimized. Adding the different settings of the proposed group shift operation, all these factors will build a large search space. We are considering introducing the neural architecture search (NAS) method to automate the optimization process, and the best model can then be automatically generated for each task. However, these experiments are not able to be included in the current paper, and instead, we will put all these as a discussion for the necessity of following studies. Hopefully, we can publish all these results in another paper soon. Nevertheless, the manual tuning experiments in the paper are sufficient to validate the effectiveness of the proposed group shift operation as proof of the concept.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have clarified the limits of their experiments, notably on the exploration of deeper networks and its limited environment for comparisons, and the necessary heavy tuning of hyper-parameters, which could be automated in a future work. These clarifications confort the dependency of hyper-parameters, but, as highlighted by the authors, the submission in itself is sufficient to validate the benefits of the proposed group-shift operation. This paper may serve as a foundation for future work, potentially impacting a broader range of applications.

For these reasons, Recommendation is toward Acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper itself has a good quality, with novel ideas and good experimental results. The reviewer’s concerns were minor. Despite that using deep networks and analysis of hyper-parameters were not shown in the paper, the authors promised to investigate them in a following work.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors investigates the feasibility of replacing classical convolutions with pointwise conv. and group shifting. This is an interesting approach to tackle the very important problem of complex and parameter-greedy 3D convolutional networks in medical image analysis. It is indeed a pity that the authors did not test their approach with deep and parameter-intensive networks, which would be the regime where this approach is most useful. However, I still find the paper very valuable in its current form and can therefore recommend the acceptance of the paper. It would be important to include the discussion on hyper param. tuning in the camera ready version. Enlarging Fig. 1 would be helpful too. Finally, a qualitative discussion/comparison with other parameter reduction techniques based on geometric symmetries would be beneficial: Bekkers, E., Veta, M.L.M., Eppenhof, K., Pluim, J., Duits, R.: Roto-translation covariant convolutional networks for medical image analysis. Local Rotation Invariance in 3D CNNs V Andrearczyk, J Fageot, V Oreiller, X Montet, A Depeursinge
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

2

back to top

Group Shift Pointwise Convolution for Volumetric Medical Image Segmentation