Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Atefeh Shahroudnejad, Xuebin Qin, Sharanya Balachandran, Masood Dehghan, Dornoosh Zonoobi, Jacob Jaremko, Jeevesh Kapur, Martin Jagersand, Michelle Noga, Kumaradevan Punithakumar

# Abstract

This paper presents a novel one-stage detection model, TUN-Det, for thyroid nodule detection from ultrasound scans. The main contributions are (i) introducing Residual U-blocks (RSU) to build the backbone of our TUN-Det, and (ii) a newly designed multi-head architecture comprised of three parallel RSU variants to replace the plain convolution layers of both the classification and regression heads. Residual blocks enable each stage of the backbone to extract both local and global features, which plays an important role in detection of nodules with different sizes and appearances. The multi-head design embeds the ensemble strategy into one end-to-end module to improve the accuracy and robustness by fusing multiple outputs generated by diversified sub-modules. Experimental results conducted on 1268 thyroid nodules from 700 patients, show that our newly proposed RSU backbone and the multi-head architecture for classification and regression heads greatly improve the detection accuracy against the baseline model. Our TUN-Det also achieves very competitive results against the state-of-the-art models on overall Average Precision ($AP$) metric and outperforms them in terms of $AP_{35}$ and $AP_{50}$, which indicates its promising performance in clinical applications. The code is available at: \url{https://github.com/Medo-ai/TUN-Det}.

SharedIt: https://rdcu.be/cyhMF

N/A

# Reviews

### Review #1

• Please describe the contribution of the paper

This paper proposed a one-stage detection network (TUN-Det) for nodule detection from thyroid ultrasound images. The authors used residual U-blocks in the backbone of the network that captures both local and global features. The authors also used three variants of residual U-blocks in the multi-head classification and regression modules that also work towards prediction ensembling. Validation of the proposed model on clinical ultrasound thyroid nodule data shows better average precision for 35% and 50% IoU thresholds, while worse for 75% IoU threshold than the state-of-the-art approaches.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposed a one-stage object detection network using residual U-blocks, which showed promising nodule detection performance in challenging ultrasound images. It also used three variants of the residual U-blocks in the classification and regression modules, that mimics an ensemble approach.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Since the proposed methods showed worse performance in terms of average precision (AP) for 75% IoU threshold as well as for mean AP, a cross-validation performance check would better reflect the method’s feasibility in detecting thyroid nodules. It is difficult to apprehend the model’s usability from the current style of data splitting and validation.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe that this method is reproducible if authors followed the exact residual U-block architecture from ref [19], as they did not discuss this architecture in detail in this paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

As I mentioned earlier that the proposed methods showed worse performance in terms of average precision (AP) for 75% IoU threshold as well as for mean AP, therefore, a cross-validation performance check would better reflect the method’s feasibility in detecting thyroid nodules.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• The proposed method is technically sound and can be used for similar other applications.
• The paper is well written and provided good reasoning for choices the authors made in different components of their network.
• However, validation performance is not significant better compared to the state-of-the-art.
• What is the ranking of this paper in your review stack?

4

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

The paper proposes a novel one-stage detection model which is applied to the Thyroid nodule detection in US images.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The use of ReSidual U-block (RSU) as a backbone of the model.
2. The development of a multi-head architecture for classification and regression tasks.
3. The paper is well written.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The survey of state-of-the-art in the area of thyroid nodule detection is poor.
2. The proposed approach is tested on one dataset, despite the public available ones.
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper presents a clear description about the algorithm. However, a private dataset is used and the range of main parameters for tuning process is not mentioned.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. The main contributions of the paper is listed as two points in the abstract, while in the introduction three main objectives are mentioned. Please clarify this.
2. If the main contribution of the paper is on Thyroid nodule detection, I suggest you to focus more on this area in the Introduction section. Now, the main focus of the introduction is on general object detection methods, which is confusing.
3. In the last paragraph of Introduction section is written “This strategy (WBF) is able to greatly improve the detection performance”, but I can not see any evidence for this statement in the Results section.
4. As the images of the dataset is collected form 12 different centers, it is important to mention the characteristics of the US machines. It is also important to know the main characteristics of the image(size, resolution, …).
5. There is no information about manual or automatic selection of images for each training, validation and testing sets.
6. It is unclear whether or not the images of each patient were exclusively tied to separate sets. If this is not the case, the validation procedure is subject to overfitting.
7. Although the value of some hyperparameters are mentioned, the range of these parameters are missing in section 3.2.
8. It is unclear whether the presented results are for validation or testing.
9. Regarding the Table 2, are all the methods applied to the same dataset? Please clarify this.
10. The qualitative results (Figure 3) show the pefromance of 5 methods, while in the Table 2 the performance of 7 methods are presented. I suggest you add some examples of the missing methods to Figure 3.
11. A discussion about the main advantages of the proposed method over the current approaches in the area of thyroid nodule detection in US images is missing.

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The clinical contribution of the paper is poor. Because there is no comparison to the current state-of-the-art in the area of Thyroid nodule detection in US images, it is not clear how the proposed approach can perform better or how it can solve the current clinical needs. For example, in the Qualitative Comparison section is mentioned that “TUN-Det can correctly detect the challenging case of a non-homogeneous large hypo-echoic nodule, while all other methods fail”. This issue is important from the clinical point of view, but it is not discussed in the paper before this section.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

The paper presents TUN-Det, a detection network for the task of Thyroid nodule detection in ultrasound scans. The novelty behind this network is that it uses Residual U-blocks as a backbone instead of using the traditional VGG or ResNet backbones. Additionally, the network includes multiple modules for bounding box classification and regression, each with a multi-head architecture that enables the creation of an ensemble strategy. The method is evaluated on a private dataset and reaches state-of-the-art performance in comparison with current detection methods.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and the clinical relevance of the problem is clearly stated. The use of Residual U-blocks as a backbone and as starting point for the multi-head modules is novel and could be translated to various medical detection problems. The dataset seems to be representative and thoroughly acquired and annotated. The results are comparable with other state-of-the-art detection methods and the qualitative results are impressive.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

One of the recurring themes throughout the paper is that two-stage methods and traditional ensemble strategies are inefficient and tend to have many parameters, making it unfeasible for a practical application. However, the use of ten modules each with three heads composed of Residual U-Blocks suggests that the method presented in the paper is also relatively big in term of parameters. Thus, the performance comparison should include information regarding the number of parameters of each model (including the ablations of each head in the modules) and inference time.

The authors report the performance of each method in four metrics, AP, AP35, AP50 and AP75. In particular the AP35 is not a threshold included in the average AP since it has IOU thresholds from 50%-95%. Thus, it seems unjustified to claim that AP35 is a better metric for practical applications.

The ablation results on Table 1. suggest that the use of CoordConv and BiFPN heads does not give a significant increase in performance. It would be interesting to see if they provide a good trade-off in performance vs number of parameters.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method is evaluated on a private dataset that will not be made public upon acceptance, according to the reproducibility checklist. The authors state that their source training and evaluation code will be public, while their pretrained models will remain private. The selected hyperparameters are mentioned but there is not a sensibility analysis included. Details of the model’s implementation and clinical relevance are included.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Clarity of the text:

• Typo on the third line of the abstract: build instead of built.
• At the end of the second paragraph of the introduction: “while requires less time costs” should be “while requiring less time.” Overall comments:
• It is not clear if the advantage of being a one-stage model is outweighed by using multiple modules with multiple heads each.
• One of the most important parts of the paper is how the outputs from the multiple blocks are combined. Thus, it would be relevant to include a parameter sensitivity analysis for the parameters mentioned in section 2.3.
• The description of the metrics that is in the caption of Table 1 should be in the Dataset and Evaluation Metrics section.

Probably accept (7)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a problem of medical relevance and it is well written. I would advise acceptance if the authors provide a further comparison of model size and inference time and a clear justification for their preferred evaluation metrics.

• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

7

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All reviewers agree on the relative novelty of the method and on its potential applicability to other problems, as well as on the quality of the writing. However, reviewers also express major concerns about the paper, which should be thoroughly addressed in the rebuttal, in particular: (1) all experiments are conducted on a single private dataset, while there are publicly available datasets for Thyroid nodule detection (R1, R2, R3), (2) literature review and comparison against state-of-the-art are poor (R2), (3) drop in performance for high overlaps (R1, R2, R3) and (4) lack of model details (R3).

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

# Author Feedback

We thank the reviewers for their constructive remarks and are happy to see all agree our paper presents a novel (R1,R2,R3) method that reaches state-of-the-art performance (R3,R1) with impressive qualitative results (R3). Major comments are addressed below. Minor typos will be fixed directly.

*AC,R1,R2: Experiments not tested on public dataset. To best of our knowledge, none of existing public datasets are suitable for proposed task (i.e. automated nodule detection). One public dataset (DDTI with 480 images) is designed for nodule classification, where most images are cropped to contain 1 nodule. Another dataset (TN-SCUI2020 challenge in MICCAI) focuses on segmentation and consists of only 1 nodule per image. TN-SCUI2020 prohibits the use of data outside challenge. No pre-cropping is applied in our data to make our study suitable for a clinical application. Many other papers use private data, and lack of public data is raised as one of the challenges[1,2]. We intend to release our dataset public upon receiving ethics approval from our institution and the code on GitHub upon acceptance.

*AC,R2: Lit review and comparison with SOTA are not focused on NODULE detection models.
We reviewed nodule detection literature and recent surveys [1,2]. Most of the papers, except for[3,4], used existing object detection models for nodule detection and do not propose a novel approach. Unfortunately, codes for [3,4] are not released. Thus, we compared our model with SOTA detection models. We will cite the missing refs.

[1]A review of thyroid gland segmentation and thyroid nodule segmentation methods for medical ultrasound images, 2020 [2]Deep learning on ultrasound images of thyroid nodules,2021 [3]Multitask cascade CNNs for automatic thyroid nodule detection and recognition,2018 [4]Automated detection and classification of thyroid nodules in ultrasound images using clinical-knowledge-guided CNNs,MIA 2019

*AC,R1,R2,R3: Performance drop in high overlaps (better AP at 35&50 IoU thresholds, worse at 75) While AP(avg) is a popular metric in benchmarking methods, in practice we seek a threshold for achieving final detection results in real-world clinical applications. According to the experiments, our model achieves the best performance under different IoU (e.g. 35,50,60) thresholds, which means our model is more applicable to clinical workflow.

In the context of thyroid nodule detection, it is a priority to minimize false negatives (FN) and not to miss any nodules. At the IoU threshold ≥75, YoloV5 misses many nodules, having low Recall with high Precision. This is unacceptable as it would miss many cancers. It is crucial to instead have high Recall with lower Precision, which is how our model performs. The 75 IoU threshold or higher is not practical because it means there should be more than 75% overlap between predicted and ground truth bounding boxes to consider the prediction as a nodule. This overly strict criteria results in excessive FN. Instead, our interest is to detect and get approximate bounding boxes for most nodules rather than obtaining a tighter bounding box. This will be helpful in post-proc for nodule segmentation. To further elaborate this, we compare #params(M), Model size(MB), inference time(ms), AP&Recall (at 75 IoU) of our model with YoloV5.

• #params – size – time – AP – Recall YoloV5 —- 7.3 —- 15 — 12 – 50.9 – 40.3 Baseline– 31.3 – 126 — 27 – 41.6 – 42.2 Ours —– 39.2 — 157 — 94 – 45.5 – 45.5 At 75 IoU threshold, although our model is inferior in terms of AP, it is doing a better job in terms of FN (i.e. our Recall is 45.5 vs 40.3 in YOLOv5). There is a trade-off between speed and accuracy in nodule detection applications, and reducing the FN is our first priority.

*AC: Lack of model details Architecture details & #filters are provided in Figs 1,2 All params in Sec 2.3 & 3.2 are set to default values. Lambdas are all set to 1, as the focus of this work was not on parameter tuning.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal adequately allays the main concerns expressed by the reviewers. The public availability of code and dataset significantly increases the impact of this work on the community. The final version should include all reviewer suggestions and comments.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

10

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Authors have satisfactorily addressed the major concerns in the rebuttal. The camera ready should incorporate the justifications provided and updated provided in the rebuttal.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall this paper received good reviews on novelty aspects, and quantitative and qualitative performance. The private dataset is fine (not ideal but understandable). The problem setup reflected by this dataset and the novelty and good performance weights more favorably. Comparing with (b) FCOS (c) RetinaNet (d) YOLOv5 is acceptable and 700 patient study is good enough. The rebuttal was nicely done by adequately addressing the original reviews.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5