Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Zhicheng Zhang, Lequan Yu, Xiaokun Liang, Wei Zhao, Lei Xing

# Abstract

Low dose computed tomography (LDCT) has attracted more and more attention in routine clinical diagnosis assessment, therapy planning, etc., which can reduce the dose of X-ray radiation to patients. However, the noise caused by low X-ray exposure degrades the CT image quality and then affects clinical diagnosis accuracy. In this paper, we train a transformer-based neural network to enhance the final CT image quality. To be specific, we first decompose the noisy LDCT image into two parts: high-frequency (HF) and low-frequency (LF) compositions. Then, we extract content features (X_{L_c}) and latent texture features (X_{L_t}) from the LF part, as well as HF embeddings (X_{H_f}) from the HF part. Further, we feed X_{L_t} and X_{H_f} into a modified transformer with three encoders and decoders to obtain well-refined HF texture features. After that, we combine these well-refined HF texture features with the pre-extracted X_{L_c} to encourage the restoration of high-quality LDCT images with the assistance of piecewise reconstruction. Extensive experiments on Mayo LDCT dataset show that our method produces superior results and outperforms other methods.

SharedIt: https://rdcu.be/cyhUx

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

In this paper, the authors proposed a dual-path transformer-based neural network (TransCT) to enhance the LDCT image quality. The transformer architecture effectively utilizes long-range dependencies between CT pixels. The dual-path structure can not only remove the image noise but also reserve the image content.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors creatively utilized the transformer-based neural network to the field of CT image denoising. Moreover, the authors innovatively designed a dual-path network to remove image noise by combining the information in high-frequency (HF) and low-frequency (LF) parts.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The validations of the model performances are insufficient. The authors only validated the model in the simulated dataset without cross-validation and lacked validation in the real clinical dataset.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provided a lot of information about the structure and training of the network. However, some small details, such as the input size of the model should be added to ensure the reproducibility of the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper creatively proposed a dual-path transformer-based neural network to denoise the LDCT images. Overall, the idea is interesting and results are promising. However, there are some concerns as following: 1) The major limitation of the paper lies in insufficient validation results. Improvements of the quantitative results are limited compared with other CNN-based methods, especially when the cross-validation was not conducted. Moreover, further validations in real clinical datasets should be implemented. 2) The ablation studies only discuss the influence of piecewise reconstruction and model size, without the effectiveness of the most critical components (transformer and the dual-path network) in the framework. The corresponding ablation experiments should be supplemented. 3) In the section of Quantitative Analysis, the PSNR metric mentioned here has not been evaluated according to the results as shown in Table 1, it seems that this metric should be replaced by RMSE. 4) In Fig2, the meaning of each subfigure should be explained. Also, the NDCT image should be added for a clearer comparison. In Fig4, the legend does not match the content of the figure.

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In this paper, the idea is interesting and the results are promising. However, in the experiments, the validations are insufficient.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #2

• Please describe the contribution of the paper

In clinical diagnosis, reducing the dose of X-ray radiation leads to more noise which degrades the CT image quality and affects clinical diagnosis accuracy. To tackle this problem, the authors introduced a transformer-based neural network (TransCT) to enhance the final CT image quality. Specifically, the noisy low dose computed tomography (LDCT) image is decomposed into high-frequency (HF) and low-frequency (LF) parts. The latent texture features (XLt) and content features (XLc) are extracted from the LF part respectively while corresponding embeddings (XHf) are extracted from the LH part. The noise-free XLt can be exploited to noise removal in XHf. XLt and XHf are used as the input of transformer encoder and decoder separately to remove noise in XHf. And XLc is integrated with the output of transformer decoder to piecewise reconstruct high-quality and high-resolution LDCT images by stage. Extensive experiments show that the proposed method produces superior results.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

a) Different from other CNN-based methods which has a limited receptive field that only perceives local areas, the authors introduced the transformer module to explore large-range dependencies between LDCT pixels. b) The authors designed a dual-path model to process the high-frequency sub-band and low-frequency sub-band and used the weakened texture information in low-frequency sub-band to remove noise in the high-frequency part. c) This paper applied two resnet blocks and two sub-pixel layers to piecewise reconstruct the high-quality high-resolution LDCT image to restore image detail more finely. d) Experimental results show that the proposed method can outperform other denoising methods for LDCT image. e) The network architecture as well as the objective functions are clearly written and the entire writing is detailed.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

a) In “Introduction” section, the review of transformer is insufficient. As we all know, since the architecture of transformer was proposed in 2017 (Vaswani et al. Attention is all you need. 2017.), many researches based on transformer have been conducted in image processing field. This proposed method is mainly based on transformer, so the authors should state a more holistic overview rather than only three lines. b) It seems to be unreasonable to integrate the weaken texture feature of HL and HF in deep layer of this network rather than the shallow one. Because in convolutional neural network the texture features are extracted in shallower layers while content information is extracted in deep layers comparatively. c) The reason why transformer can explore large-range dependencies between LDCT pixels is not explained clearly. d) Different from using “Conv+lrelu” to extract the features of XL, the author use sub-pixel layers to extract features of XH, I wonder why the authors use sub-pixel layers? In addition, if sub-pixel layer is superior, why not use it to get XLt likewise. e) The “Experiments” section: There are only 10 patients in the dataset, so I wonder what validation method has been applied and if the experiment results are statistically meaningful. In comparison experiments, the enhance of the proposed method is not obvious especially compared with RED-CNN (2017) and in addition if there are other methods could be compared with. In the ablation study, why not validate the validity of transformer module.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

From reproducibility checklist filled out by the authors, we can see that the authors would like to release all code related to this work, and the relevant dataset description and experimental settings are included in the submitted manuscript. Based on the above, this paper has a good reproducibility.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

a) The authors should give a more detailed overview of transformer in the “Introduction” section. b) As described by the authors, the main purpose of using transformer is it can explore large-range dependencies so why transformer have this advantage should be explained clearly. c) The applied validation method should be described clearly and the experiments should be proved to have statistical meaning. As for comparison experiments, more methods should be compared to show the proposed method is persuasively superior and the ablation experiments need to be well redesigned to prove the effectiveness of transformer. d) In fig.1. the “XLc2” is in wrong position. e) In the “Loss Function” part, in “IND is the LDCT image”, the “IND” should be “ILD”. f) in “Comparison with other methods” section, in “Fig 1 shows the results randomly selected from the testing dataset”, the “Fig 1” should be “Fig 2”. g) In the “Quantitative Analysis” part of experiments, the “PSNR” in line 4 should be “RMSE”.

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper has a reasonable structure and a clear logic. The main contribution of this paper is that the authors proposed a transformer-based neural network (TransCT) to enhance the final CT image quality and the dual-path model can use the weakened texture information in low-frequency sub-band to remove noise in the high-frequency part. In “methods” section, the authors clearly describe the workflow of the whole network. Second, the authors explained large number of formulas in the paper which greatly enhances the persuasiveness. However, its disadvantages are also obvious. As mentioned before, the motivation of the application of transformer is not stated clearly and the details of transformer need to be better described. Another serious problem is that the comparison experiments is not sufficient to prove the superiority of the proposed method. In addition, the ablation experiments also do not prove the effectiveness of transformer. Therefore, I cannot give this paper a higher score

• What is the ranking of this paper in your review stack?

5

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #3

• Please describe the contribution of the paper

The authors designed a dual-path Transformer framework for low-dose CT images. An input noise image is decomposed into high-frequency (HF) and low-frequency (LF) compositions and the features from LF images are used to restore finer structures in HF images.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Apply the Transformer to reduce noise for low-dose CT.
2. Design a Dual-path architecture, which decomposes an input noisy image into high-frequency (HF) and low-frequency (LF) parts.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some equations are not correctly formulated.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The architecture of the model is complicated. I suggest making the codes publicly available.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The equation (4) does not make sense. It compares the normal-dose image and the uncorrected low-dose image, in fact, the output image should be evaluated. Besides, there is a typo: “I_ND is the LDCT image” -> “I_LD is the LDCT image”.

2. This work applied a Gaussian filter with a standard deviation of 1.5 to decomposed a noisy CT image into HF and LF parts. I’m curious how to choose SD=1.5 for the Gaussian filter? How does this parameter affect the final results?

3. Some descriptions are not rigorous. In the first paragraph of section 2, “there are also weakened latent textures in the LF part, which are noise-free.” In fact, noise contributed to any frequency, therefore the LF part also contains reduced noise.

4. It is better to present the method for each subfigure in the captions of Figs. 2 and 3.

5. The display windows are [-160, 240] HU for Fig.2 and [0, 200] HU for Fig.3. But it seems that Fig. 3 has a wider window. Please check the display window setting.

6. The caption for Fig. 4 is wrong.

7. “resnet” -> “ResNet”, give the full name “Leaky-ReLU” for the first “lrelu”.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of this work is strong.

• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

3

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper has novelty in using a Transformer-based NN for low-dose CT reconstruction and has several weaknesses and limitations pointed out by reviewers. The major weaknesses include the lack of sufficient validation and lack of ablation study on the transformer module. Reviewer 2 has pointed out important constructive feedback that needs to be addressed.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

# Author Feedback

We appreciated the favorable comments on our interesting and novel method design (R1, R3). Below, we address the main issues about method evaluation. We will address other minor comments/constructive suggestions and release our code in the final version.

Q1: More validation (R1, R2) As per suggestions, we conducted 5-fold cross-validation for all methods to show the superiority of our method. The results (RMSE/SSIM/VIF) are: LDCT: 37.167±7.245/0.822±0.053/0.079±0.032 NLM: 25.115±4.54/0.908±.0.31/0.133±0.037 RED-CNN: 22.204±3.89/0.922±0.025/0.152±0.037 MAP-NN: 22.492±3.897/0.921±0.025/0.150±0.038 Ours: 22.123±3.784/0.923±0.024/0.153±0.039 Our method achieves better results than other methods.

Limited by clinical ethics, we evaluated our method on existing clinical CBCT images from a real pig head for clinical validation. The tube current was: 80mA for NDCT and 20mA for LDCT. The results are: LDCT: 50.776±3.7/0.701±0.02/0.023±0.002 NLM 42.952±5.971/0.799±0.043/0.040±0.004 RED-CNN: 37.551±5.334/0.861±0.03/0.066±0.006 MAP-NN: 37.744±4.883/0.86±0.027/0.063±0.006 Ours: 36.999±5.25/0.87±0.029/0.069±0.007 Our method outperforms others with superior robustness.

Q2: Ablation study on transformer module and dual-path module (R1, R2) We conducted more ablation studies to show the effectiveness of each module design. For the transformer module, we used a revised module (“Conv+3xResNet blocks”) to replace the transformer module. We concatenated $X_Hf$ and the output from the fourth Conv layer (n128s2, before $X_Lt$) and then inputted it into the revised module. The results on the validation dataset: 22.62±2.068/0.927±0.013/0.13±0.023, which is worse than our full method: 21.199±2.054/0.933±0.012/0.144±0.025. As for the dual-path module, we discarded the HF path and inputted the $X_{L_c2}$ into 3 transformer encoders, whose output will be combined with $X_{L_c1}$ and $X_{L_c2}$ in the piecewise reconstruction stage. The results on the validation dataset: 21.711±1.997/0.931±0012/0.14±0.025, which is also worse than our full method.

Q3: Motivation of using transformer and the benefit (R2) Different regions within an image have similarities, which is beneficial to feature extraction. CNNs rely on cascaded Conv layers to expand the receptive field and extract high-level features, while it is not easy for CNNs to fully utilize the image similarity across regions on a large scale. By contrast, the self-attention-based transformer can model all pairwise interactions between image regions and capture long-range dependencies by computing interactions between any two positions, regardless of their positional distance.

Q4: The integration of the weakened texture feature of HL and HF in the deep layer (R2) The main focus of our method is to extract the weakened texture features of HL from XL, which are used to restore finer structures in the HF part. Therefore, the extraction of weakened texture features of HL is essential. For XL, LF content features are only used to restore clean CT images in the piecewise reconstruction. Highly abstract LF content features from the deeper layer maybe contribute to the continuous improvement of image quality, which is one of our future investigation directions.

Q5: The reason to use sub-pixel layers for XH while not for XL (R2) Combining sub-pixel and Conv layer can expand the receptive field with fewer network parameters. It is suitable to obtain fixed-size $X_Hf$. For XL, we acquire not only $X_{L_t}$ but also the intermediate results ($X_{L_c1}$, $X_{L_c2}$) with different sizes for piecewise reconstruction. Since the sub-pixel layer can only generate fixed-size patches, it is not suitable for XL.

Q6: Ablation study on the standard deviation (SD) of Gaussian filter (R3) SD=1.0: 21.297±2.037/0.933±0.012/0.144±0.025 SD=1.5: 21.199±2.054/0.933±0.012/0.144±0.025 SD=2.0: 21.201±2.025/0.933±0.012/0.144±0.025 SD=2.5: 21.229±2.024/0.933±0.012/0.143±0.025 As SD changes, performance does not change much.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed the major concerns from the reviewers: validation, clinical evaluation, ablation study, and motivation of using Transformer. The paper is deemed sufficient for publication in MICCAI given the reviewers’ comments will be addressed in the revised version for final submission.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper works on CT reconstruction, and proposed a dual-path transformer network. The reviewers take this network design as a novelty, however, have concerns on the experimental justification of network design, the explanation on the utilization of transformer, application on real data, etc. The authors presented new results in the response and answered most of these concerns. From the results, it seems that the proposed approach has marginally better result than RED-CNN(21.428±3.517 vs. 21.603±3.608 2). Considering that the variance is large, this improvement is not significant. This raised my concern on the effectiveness/improvement of the dual-transformer network (at least in the current version) compared with the previous work. I would like to stand on the borderline reject on this work.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

16

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have provided more expeirmental results as well as ablation study that have sufficiently improved the method evaluation.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10