Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Kazuya Nishimura, Hyeonwoo Cho, Ryoma Bise

# Abstract

Cell detection is the task of detecting the approximate positions of cell centroids from microscopy images. Recently, convolutional neural network-based approaches have achieved promising performance. However, these methods require a certain amount of annotation for each imaging condition. This annotation is a time-consuming and labor-intensive task. To overcome this annotation problem, we propose a semi-supervised cell-detection method that effectively uses time-lapse images. Our method can improve detection performance using one sequence with one labeled image and the other images unlabeled. We select high-confidence positions from the detection results of the detection network that is trained with the one labeled image. Then, we generate a pseudo label from selected high confidence positions. Then, we generate pseudo-labels from the selected high-confidence positions. We evaluated our method for six conditions of public datasets, and we achieved the best results relative to other semi-supervised methods.

SharedIt: https://rdcu.be/cymar

# Reviews

### Review #1

• Please describe the contribution of the paper

A semi-supervised cell detection model for time-lapse microscopy image sequence is proposed in this work. First, a detection model is trained with a labeled frame, then, based on tracking the estimated cell positions on the unlabeled frames, the pseudo label is generated for retraining the detection model.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

First, this work explores a new application. This work proposes a semi-supervised cell detection model for a time-lapse microscopy image sequence in which only one frame is labeled while others are unlabeled. In this scenario, the labeled image and the unlabeled ones are with different cell distributions. Second, the idea of this work is interesting. After training a detection model with the labeled image and estimating the cell positions on the unlabeled frames with this detection model, a tracking method is used to help generate the pseudo label. This is relatively novel.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The clarity of this paper is poor. Some details are missing, and some descriptions are not accurate. These make the paper hard to understand. (a) Page 4, in the caption of Fig.2, ‘‘we generate pseudo labels and masks from high-confidence detected results that is selected by the tracking’’., and the authors use a one-by-one matching strategy to track cells. How is the tracking performed? e.g., how to define the matching between two cells? Is the tracking conducted from frame t to frame t+1, t+2, t3…, or from frame t to frame t+1, and from frame t+1 to frame t+2, from frame t+2 to frame t+3…? (b) Page 4, in the ‘‘cell detection’’ subsection, what does the arrow above p_{t} mean? (c) In the first paragraph of Page 5, what does the last sentence mean? What is the definition of ‘‘tracked position ratio’’? (d) In the second paragraph of Page 5, when defining the first type of unreliable regions, why the authors only consider frames after the labeled frame? This type of regions can also appear in previous frames, e.g., a cell goes out the observation scope. (e) In the third paragraph of Page 5, ‘‘The second is a region that is not detected but there is a cell (miss detection). If detected points in a certain region are continuously tracked from frame l until the previous frame t_{ut}’’. If detected points in a certain region are continuously tracked, why these points are regarded as miss detection since they are already detected? (f) When defining the unreliable regions in Eq.(1), why use different radius for these two different unreliable regions? and for the second type unreliable regions? why the radius varies with the temporal information t? (g) In the last paragraph of Page 5, ‘‘where i is a coordinate’’. What is ‘‘i’’?

(2) The evaluation of the work is not solid. (a) In this paper, the F-1 score is adopted to evaluate the cell detection model. Usually, the AP or mAP is used to evaluate the detection model. (b) It seems that the design of region mask is a main contribution of this work, however, there is no ablation study on this designed mask. (c) In the last paragraph of Page 7, ‘‘Fig. 4 shows the average F1 score of the three datasets for each frame in the training data.’’ Why computing F1 score for the training data instead of the testing data? (d) Fig.4 show that from frame 20 to frame 1, the F1 score first decreases, and then increases. Do the authors have any idea about the the possible reason? Any analysis? (e) In this paper, the proposed detection model is compared with a semi-supervised key-point localization method, a cell segmentation method. As tracking strategy is adopted in this work, it is better to compare with some semi-supervised cell tracking methods.

(3) The writing of this paper need to be improved. There are many grammar errors and typos in the paper.

• Please rate the clarity and organization of this paper

Poor

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Based on the reproducibility checklist, I think this work is reproducible.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

(1) The writing of this paper need to be improved. More details should be included to clearly present this work. Please refer to the weaknesses for more details.

(2) More experiments are needed to better demonstrate the effectiveness of the proposed detection model, such as the ablation study on the designed region mask, ablation study on some important hyper-parameters (e.g., ration of the tracked positions alpha), comparison with semi-supervised cell tracking algorithms.

reject (3)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of this work is interesting and relatively novel. However, considering the poor clarity and insufficient evaluation of this work, I think this paper is not good enough to be accepted.

• What is the ranking of this paper in your review stack?

5

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

The proposed framework interactively improves its detection network by including pseudo labels from adjacent frames in a time-lapsed sequence of images. The labels are generated by tracking cells across similar frames and using preliminary detection results from the network to mask out unreliable estimation. Results show that the proposed method outperforms a few baseline methods by a notable margin.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

See below. Will fill after rebuttal.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

See below. Will fill after rebuttal.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The results seem reproducible. The checklist is consistent with the paper.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

I haven’t found major weaknesses of this paper but I have to admit that I’m not an expert on cell image processing.

Some minors:

[11] Li et al. seems one of the most similar works but it’s not a baseline method.

It’s not fair to compare this work with [13] Moskvyak et al. because [13] is not tuned for cell images and it’s not clear whether the network architectures between this paper and [13] are comparable.

Page 3, the authors’ argument that a bounding box-based detection model is not suitable for this study is not convincing. “The cost of annotating a bounding box of a cell is expensive since a cell has deformed shape with blurry boundaries.” [11] Li. et al. is an example.

Page 2 first paragraph last sentence, “Using consistency loss … shows that …”. This statement needs a reference because it’s not shown in this paper.

Page 3 last paragraph, “$\mathcal{X}={x_t}^T_{t=0}$…T is the number of frames”. Either T-1 or t=1, otherwise there are T+1 images.

Page 6 last line, “We used one image at the 20th frame as labeled frames … on all datasets”. Why 20th? Does choosing another frame have an impact on the results?

borderline accept (6)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• The paper is clear in general. The story is coherent. The design of the framework is reasonable and the results show its superiority.

• The paper needs further explanation in several places that I mentioned above.
• The results seem weak because the network architectures, which are not clear to the reader, may not be comparable. I’m not an expert on cell image processing but evaluation on C2C12 seems redundant because it does not offer too much information in additional to the first experiment. The authors should instead validate and analyze their framework and visualize the results to convince the readers that their framework does work. Numerical results on real datasets alone may not be enough.
• What is the ranking of this paper in your review stack?

2

• Number of papers in your stack

5

• Reviewer confidence

Somewhat confident

### Review #3

• Please describe the contribution of the paper

This paper presents an interesting method for semi-supervised cell detection that effectively uses a time-lapse sequence, in which one image is labeled data and the other images are unlabeled. The model is iteratively trained on labeled and unlabeled samples, where pseudo-labels for unlabeled samples are obtained by tracking. The proposed method is evaluated on two public datasets, and it significantly outperforms other fully supervised and semi-supervised methods.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

– The idea of exploiting the temporal consistency in time-lapse microscopy images is very interesting, novel, and a smart trick. – Thorough evaluation on two datasets with strong results – The paper is clear and well-written.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There are no major weaknesses in the paper.

• Please rate the clarity and organization of this paper

Very Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I carefully assessed the sensibility of the experiments, and the results seem convincing.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

– In pseudo-labeling methods, there is always a high risk of transferring inaccurate pseudo-labels to the retraining stage, which is harmful to the model [1]. Although the author’s obtains high-confidence scores by tracking, but its highly sensitive to hyper-parameters ‘l’ and selected frames [b,a]. It is good to provide ablation studies on these hyper-parameters to know how the performance varies across datasets. [1] Arazo, Eric, et al. “Pseudo-labeling and confirmation bias in deep semi-supervised learning.” 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020.

– Very often, the proposed self-training method is known to be computationally expensive since it takes longer time to converge. I would suggest reporting the computational complexity of the method, which is a critical factor in this application.

– In Fig. 5, frame l+10 and l+20, the close proximal cells are detected as one single cell. Is there a way to overcome such a challenge in your proposed method. Also, dicuss the limitation of the method, which is missing in the paper.

– The network is iteratively retrained until it reaches certain iterations, which is selected to be ‘gamma=3’. It’s not quite clear to me how did the network converge quickly within 3 iterations ?

– In Table 1, 2 please mention in the caption that the evaluation is in terms of F1_score, to make it easy for the reader.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

– Exploiting the temporal consistency in the time-lapse microscopy images for semi-supervised learning is a smart trick. – Use of just one single labeled data to label the rest of unlabeled images in a sequence is a challenging task. – Evaluation on two publicly available datasets and comparison with other methods.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #4

• Please describe the contribution of the paper

The paper uses tracking as a semi-supervised method to provide samples from unlabelled images to train the cell detector.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The method only need one label image of one sequence, which I believe can save a lot of time in labelling cells.

2. The authors have considered the possibility and impact of miss detection in tracking.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Paper format. Fig.4 is out of the context.
• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have claimed the code and evaluation can be provided.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Although the authors come up with the solution to handle miss detection in pseudo label generation, its the impact is well studied. Since the miss detection can hardly avoid, maybe the authors should provide more experiments to ensure miss detection is not a big concern.

accept (8)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall I like the idea of using tracking to help provide pseudo labels and use the generated pseudo labels to train the network again. Considering the miss detection in pseudo labels is a plus.

• What is the ranking of this paper in your review stack?

4

• Number of papers in your stack

7

• Reviewer confidence

Very confident

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes to use cell tracking, starting from an annotated image as the initial frame, to generate pseudo labels on other video frames to train a cell detector.

The reviewers support this semi-supervised way to generate pseudo labels for training cell detection networks, but they also raised many detailed questions and comments to improve the paper. Some major ones include: (1) how sensitive is the algorithm regarding the initial labeled frame? The paper use the 20th frame, but some frames may have one cell or many cells as shown in Fig.2. (2) how will the inaccurate pseudo labels affect the cell detection network training? Figure 4 also shows the detection performance quickly drops in 60 frames. In-depth analysis about the effect of missing detection and false positive during tracking on the detection network training is needed. (3) The performance on some datasets s around 0.8 (F1) failure case studies will provide some insights.
There are many other detailed questions/comments raised by reviewers on the methodology and evaluation. Please clarify in the rebuttal.

In addition to the reviewer’s questions, the AC has a few more: how is a detection network trained with a single initially-labeled image without any overfitting? If the initial detector performs bad, how does this affect the following iterative training? There are quite some related works in the computer vision community on joint object detection and tracking, detection-based tracking, and tracking for detection. The proposed work has very similar ideas with some previous works, for example “Semi-supervised learning for object detectors from video” CVPR2015 [12] where tracking is used for temporally consistent detection (or object detector training and updating). The difference between [12] and the proposed work is: [12] picks cars as the object of interest with box annotations and SVM as the detector, to demonstrate the SSL to train object detectors using temporal consistency. The proposed work uses cells as object of interest with heatmap-based detector [14]. Performing a comprehensive survey with sufficient comparisons would highlight the fundamental contribution of the work.

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

4

# Author Feedback

We would like to thank all the reviewers for their insightful comments and positive evaluation. For instance, R2 and R4 commented that the idea of this work is interesting and novel. R3 commented that the paper has a clear and coherent story, and the result shows its superiority. We would like to address the concerns raised by the reviewers as follows.

Q1: Sensitivity of the initial labeled frame. The reason for using the 20th frame. In a preliminary study, we confirmed that our method can work on other initial frames and achieved comparable performance if the ground truth positions are correct in the initial frame. However, if the initial frame has some annotation gap, the performance decrease. Unfortunately, we found that the ground truth in DIC-C2DH-HeLa contains incorrect coordinates for some cell positions after the 20th frame. Therefore, we selected the 20-frame as the initial frame for all data set.

Q2: Effect of the inaccurate pseudo labels. The reason the performance drops at 60 frames. Inaccurate pseudo labels affect the performance. To avoid this issue, pseudo labels are added only for successfully tracked cells from the labeled frame, and ambiguous areas are ignored by a masked loss. Therefore, our method is relatively robust for various datasets. We state that the vertical axis range in Fig. 4 is scaled from 0.8 to 1.0, and the decrease at the 60th frame is not large. While the performance monotonically decreases with the time difference from the labeled frame, the performance sometimes decreases for several frames due to mitosis or severe touch (the appearance drastically changes). Even our method improves the performance, it depends on the baseline. We consider that this is the reason.

Q3: Investigation for insignificant cases. We investigated the results pointed out by the reviewer. The improvements in the datasets were not so improved since the tracking errors such as switching to the false positive detection occurred at the early step of the iteration. For example, in DIC-C2DH-HeLa, cells to be prone to over-detection at cell boundaries, and this caused the switch errors. While this is one of the limitations of our method, our method still improved the performance even in such cases. Further improvements are our future work.

Q4: Overfitting of the network We consider that the network trained using only one frame became overfitting. For instance, the performance in the frames close to the labeled frame is better than those far from it since cell appearances change with time. We designed our method to use this overfitting property (good performance in the closer frames); our method iteratively adds the pseudo labels to the closer frames by tracking. As result, enough variation of cell appearances is added as the pseudo labels (avoid overfitting), the performance is improved.

Q5: Difference from [12] Our method and the previous method [12] have many different points. Their method was designed for the bounding box and assumes not so dense. Their method first trains the object detector that identifies the well-detected bounding boxes using the sparsely annotated positive samples and the negative samples that were corrected from a web search. This strategy could not be applied to cell detection due to the following reasons; 1) in our target, the cell tends to densely distributed, and thus a bounding box contains several cell regions (e.g., https://motchallenge.net/vis/OK-run03/gt/). It affects their object detector. 2) it is difficult to automatically collect the negative samples from web, it requires additional annotation. To address these issues, we proposed the heat-map-based semi-supervised cell detection method. To perform good pseudo labeling in this heat-map method, we have to consider many techniques such as the ignore mask and the rule of the tracking and iteration otherwise it could not work in the cell detection task. We believe our technical novelty will contribute to the MICCAI community.

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The author’s feedback addressed some issues raised by the reviewers and AC. But, there could be some better ways leading to improvements. For examples, regarding the sensitivity on the initialization, authors responded like “the dataset has incorrect coordinate after the 20th frame, so the 20th frame is used as the initial frame.” Soma basic statistical analysis will be suitable for the sensitivity analysis on different initially-labeled frames. Though experiments are not needed during rebuttal, but how to perform these sensitivity analysis can be explained. Similarly, the analysis is also suitable for the sensitivity about inaccurate pseudo labels. Training a reliable network-based cell detector using a single image at the beginning is still questionable.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

A semi-supervised cell-detection method is proposed for time-lapse sequences that requires only one labeled frame and uses tracking to generate pseudo-labels for the other frames for further training. The evaluation results on public data are promising. The majority view among the reviewers is that this is interesting work and the paper can be accepted. Altogether many issues were raised by the reviewers and the authors have addressed the most critical, though it is not clear how the paper will be revised. The authors should improve their paper along the lines suggested by the reviewers.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The proposed method is able to detect individual cells using only one labeled image in a semi-supervised learning manner. The rebuttal has addressed the major concerns from reviewers. Although the proposed method is not compared with other semi-supervised object tracking methods in the experiments, the proposed framework might be interesting and novel from the viewpoint of cell detection.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3