Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Dazhou Guo, Xianghua Ye, Jia Ge, Xing Di, Le Lu, Lingyun Huang, Guotong Xie, Jing Xiao, Zhongjie Lu, Ling Peng, Senxiang Yan, Dakai Jin

Abstract

Lymph node station (LNS) delineation from computed tomography (CT) scans is an indispensable step in radiation oncology workflow. High inter-user variabilities across oncologists and prohibitive laboring costs motivated our automated approach. Previous works exploit anatomical priors to infer LNS based on predefined ad-hoc margins. However, without the voxel-level supervision, their performances are severely limited. LNS is highly context-dependent—LNS boundaries are constrained by anatomical organs—we formulate it as a deep spatial and contextual parsing problem via encoded anatomical organs. This permits the deep network to better learn from both CT appearance and organ context. We develop a stratified referencing organ segmentation protocol that divides the organs into the anchor and non-anchor categories and uses the former’s predictions to guide the later segmentation. We further develop an auto-search module to identify the key organs that opt for the optimal LNS parsing performance. Extensive four-fold cross-validation experiments on a dataset of 98 esophageal cancer patients (with the most comprehensive set of 12 LNSs + 22 organs in thoracic region to date) are conducted. Our LNS parsing method can produce significant performance improvements, with an average Dices core of 81.1%±6.1%, which is 5.0% and 19.2% higher over the pure CT-based deep model and the previous representative approach, respectively.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_1

SharedIt: https://rdcu.be/cyl5s

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper presents a method for lymph node station segmentation using CT scans. The method consists of three different steps. First, 22 organs are annotated and a deep learning scheme is used in order to learn their segmentation maps using two different networks while separating the organs on anchor and not anchor ones. Then, a key organ auto-search is used to identify the more informative organs and lastly a deep learning network is used for the final segmentations of the different lymph node stations. The method is evaluated on a private dataset including 12 LNS.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper deals with a very important and challenging problem.
- The authors base their methodology on state of the art methods.
- The produced segmentations are evaluated with 3 different metrics.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors propose a pipeline which is composed of different modules trained one after the other.
- Some of the decisions of the authors are not really justified making it difficult to judge the soundness of the method. For example the training of two different architectures for the anchor and non-anchor organs are not empirically evaluated, while it augments the complexity of the algorithm.
- One very important point that is not clear in the paper is the way that the evaluation is performed. The authors discuss about 5-cross validation, however they do not really mention their testing cohort. Reporting the performance of the proposed method on cross validation is not a good practice as it is not really highlighting the performance of the method on the unseen test set.
- The authors include comparisons only with traditional methods.
- The computational complexity of the method is not really discussed in the paper.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The evaluation of the paper is based on a private dataset which is not available. Code is not included on the submitted version, however the implementation details are clearly presented on the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- The authors should include more experiments to highlight the need of each component of the method.
- Computational times (training, inference) and parameters of the proposed method should be presented and discussed in the paper.
- It is not very clear if the stratified organ segmentation component is trained end-to-end or sequentially. One experiment training the nnU-Net on the total 22 organs needs to be added to justify the use of the two different nnU-Nets.
- The authors should include and discuss the selected components of the nnU-Net (selected losses, architecture, etc)
- Validation on an external cohort could highlight better the performance of the method. I would suggest to the authors to see if they can use the [1] for some additional validation downloading the available Lymph Node dataset of the TCIA.
[1] A Seff, L Lu, A Barbu, H Roth, HC Shin, RM Summers. Leveraging Mid-Level Semantic Boundary Cues for Automated Lymph Node Detection. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, 53-61
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The paper presents a pipeline for LNS segmentation composed of different steps but their justification is not clear.
- The evaluation protocol is not really clear, introducing some questions about the performance of the proposed method.
- The computational complexity of the method is not discussed in the current version.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #2

Please describe the contribution of the paper

The manuscript introduces a deep learning workflow for the segmentation of thoracic lymph node stations (LNS) from CT images. The proposed approached, namely DeepStationing, performs segmentation of 22 reference organs and use them as additional channels to feed into the LNS segmentation network. Both segmentations of reference organs and LNS are performed under the help of nnU-Net with automated hyperparameter searching and key reference organ selection.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Clear motivation with clinical impact.
2. High segmentation accuracy.
3. Effectiveness in proposed auto-search.
4. The manuscript is presented very clearly.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Ablation study could be more complete.
2. More baseline solutions could be compared.
3. Dataset is on the small side.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Although it is not claimed that the code would be shared, the manuscript made clear description of implementations which would make it feasible to reproduce the work even without exactly the same dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
The major strengths of the manuscript include the following:
1. The automatic segmentation of thoracic lymph node stations is an important yet less-studied problem in clinical practices.
2. The proposed method showed effectiveness by achieving high level of accuracy in performing LNS segmentations, even for LNS that are difficult to delineate.
3. Comparison with approaches using all reference organs demonstrates the effectiveness of auto-search.
The manuscript could improve by addressing the following questions:
1. The manuscript would need more complete ablation studies to determine which elements of the workflow are the key contributors to the performance. If results obtained using selected organs are better than those obtained using all ground truth organ segmentations, does it mean the selection of references is more important than segmenting the organs with high accuracy?
2. Would a rough segmentation or localization of the reference organs be sufficient to assist the LNS segmentation step?
3. To better understand what’s the leading reason to the performance enhancement, additional comparisons could be made including: Randomly select a subset of N organs with N being a parameter to assess, select the top N organs by segmentation accuracy, select only the anchor organs, select the top N organs by size ranking, select the top N organs by relative distance to each LNS.
4. More hyperparameters as outputs from the nnU-Net could be reported, e.g. the top 6 anchor organs selected, patch size.
5. Could the approach be assessed by turning the organ segmentation masks into distance transformations?
6. From the descriptions in Section 2.2, the selection of reference organs seems to be a hard (not soft) decision out of the auto-ML. Would there be ways to apply heavier learnable penalizations on channel weights than using softmax, e.g. using a mechanism like the channel-based attention?
7. It would be useful to assess the performance of the approach on patients with lymph node metastasis in the thoracic region, as well as metastatic solid tumors.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The technical novelties are not strong enough to put the manuscript in the highest buckets but the performance of the approach is strong and the presentation is very clear. Experiments could be more complete.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

The paper presents an automated method for LNS detection and automated organ search. The method achieves significant improvement in DSC score.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Anchor and non anchor organs are used in a way to improve the performance. The model reports significant DSC scores.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Three instances of baseline NN are used in the pipeline, what would this mean in terms of training time and actual clinical utilization?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The annotations should be made public if possible for reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Three instances of baseline NN are used in the pipeline, what would this mean in terms of training time and actual clinical utilization?

In general the paper presents a comprehensive set of experiments. However since the annotations are performed by a radiologist, did authors considered handling bias that could be introduced during the annotation process?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methodology and experimental results are convincing and improve upon the current state-of-the-art.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper introduces a deep learning workflow for the segmentation of thoracic lymph node stations (LNS) from CT images. The strengths of the paper include:1) high performance; 2) the scheme of efficient auto search; 3) clear presentation. The points should be addressed in rebuttal:1) small data set; 2) justification of adopted method; 3) baseline comparison.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

Author Feedback

We thank all reviewers and AC for their constructive reviews, also in acknowledging that our paper solves a challenging and important problem, outlines an effective method with high performance and is clear-written.

Q1: Small dataset: Our dataset has 98 patients with esophageal cancer and half of them contain metastasis lymph nodes (LN). We agree that our dataset may be relatively small, but to the best of our knowledge, our dataset is the largest LN Stationing dataset reported to date (the previous largest dataset has 70 patients [7]). Taking the ~4hrs per patient manual labeling into consideration, we plan to gradually include more patients in the future. While 98 patients if used under the setting of 4-fold cross-validation (CV) should be quite reasonable and acceptable. The TCIA dataset (suggested by R1) only contains LN labels but does not have the LN station (LNS) annotations, which is ineligible to be used in our work.

Q2: Justification of the proposed method: We supplement the experimental results by segmenting all organs using only one nnUNet. The average DSCs of the anchor, non-anchor, and all organs are 86.4%, 72.7%, and 80.8%, which are 3.6%, 9.4%, and 5.7% less than the stratified version, respectively. Our stratified organ segmentation is trained in an end-to-end fashion and demonstrates good novelty. We further include 6 ablation studies based on R2’s suggestions by segmenting LNS using (1) randomly selected 6 organs; (2) top-6 organs with best organ segmentation accuracy; (3) anchor organs; (4) recommended 6 organs from the senior oncologists; (5) searched 6 organs predictions from less accurate non-stratified organ segmentor; (6) searched 6 organs GT.

-The randomly selected 6 organs are: V.BCV (L), V.pulmonary, V.IJV (R), heart, spine, trachea; -The 6 organs with the best segmentation accuracy are: lungs (R+L), descending aorta, heart, trachea, spine; -Oncologists recommended 6 organs are: trachea, aortic arch, spine, lungs (R+L), descending aorta; The DSCs for setups (1-6) are 77.2%, 78.2%, 78.6%, 79.0%, 80.2%, 81.7%; the HDs are 19.3mm, 11.8mm, 12.4mm, 11.0mm, 10.1mm, 8.6mm, respectively.

In comparison to the LNS predictions using only CT images (DSC=76.1%, HD=22.6mm), the ablation studies demonstrate that using the supporting organ for LNS segmentation is the key contributor for the performance gain, and the selection and the quality of supporting organs are the main factors for the performance boost, e.g., our main results of the setups (5) and (6) show that better searched-organ delineation can help get superior LNS segmentation performance. We will elaborate on this fully in the later version.

Q3: Baseline comparison: We reimplemented/tested the most recent leading approach in LNS segmentation discussed in [7]. The methods reported in [2, 8, 14] are not publicly available, and the station-wise parameter tuning makes their approaches very hard to reimplement. We tested additionally 2 backbone networks: 3D PHNN (3D UNet with a light-weighted decoding path) and 2D UNet, as our new baseline comparisons. The DSCs of 3D PHNN and 2D UNet are 79.5% and 78.8%, respectively. The assumed reason for the performance drop might be the loss of the boundary precision/3D information. R2 suggested using channel-based attention for the organ auto-selection. We agree that attention could be a good direction (out of the scope here), and we will leave it for future work.

Q4: R1 had concerns about the unseen testing data: In k-fold CV, the original sample set is randomly partitioned into k equal-sized subsets. Of the k subsets, a single subset is retained as the unseen data for testing, and the remaining non-overlapping k-1 subsets are used for training and validation. The testing data is thus considered as unseen in k-fold CV.

Q5: nnUNet details: The patch size is 256x256x64, the batch size is 8. The architecture is 3D full-resolution with DSC+CE losses. The average training/inference time is 2.5 GPU days or 3 mins.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper presents a deep learning workflow for the segmentation of thoracic lymph node stations (LNS) from CT images. It also develops an auto-search method to identify the key organs. They conducted extensive comparison studies and demonstrate high performance. The rebuttal sufficiently addresses the small data set concerns although a bigger data set would still be desirable.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The reviews have highlighted the fact that this paper proposes a pipeline that is very efficient. With additional experiments, authors are able to experimentally justify their approach, which was one of the main weakness noted by the reviewers. I found the authors rebuttal was reasonably responding to the concerns raised by the reviewers, and hence I recommend acceptance for this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Reviewers recognized the clinical motivation, and the reasonable method proposed. Major concerns include details of the whole pipeline, justification of different components (ablation), and that it will be better if the method can be evaluated on open datasets. In rebuttal, authors provided more details regarding the dataset, and added more experiments (which is not appropriate for rebuttal). I would say the rebuttal is not a good one. But despite an unsuccessful rebuttal, considering the overall quality and contribution, I will still side with the judgement of acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

back to top

DeepStationing: Thoracic Lymph Node Station Parsing in CT Scans using Anatomical Context Encoding and Key Organ Auto-Search