Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiawei Yang, Yao Zhang, Yuan Liang, Yang Zhang, Lei He, Zhiqiang He

Abstract

Deep learning models are notoriously data-hungry. Thus, there is an urging need for data-efficient techniques in medical image analysis, where well-annotated data are costly and time consuming to collect. Motivated by the recently revived ‘‘Copy-Paste’’ augmentation, we propose TumorCP, a simple but effective object-level data augmentation method tailored for tumor segmentation. TumorCP is online and stochastic, providing unlimited augmentation possibilities for tumors’ subjects, locations, appearances, as well as morphologies. Experiments on kidney tumor segmentation task demonstrate that TumorCP surpasses the strong baseline by a remarkable margin of 7.12% on tumor Dice. Moreover, together with image-level data augmentation, it beats the current state-of-the-art by 2.32% on tumor Dice. Comprehensive ablation studies are performed to validate the effectiveness of TumorCP. Meanwhile, we show that TumorCP can lead to striking improvements in extremely low-data regimes. Evaluated with only 10% labeled data, TumorCP significantly boosts tumor Dice by 21.87%. To the best of our knowledge, this is the very first work exploring and extending the ‘‘Copy-Paste’’ design in medical imaging domain. Code is available at: https://github.com/YaoZhang93/TumorCP.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_55

SharedIt: https://rdcu.be/cyhMy

Link to the code repository

https://github.com/YaoZhang93/TumorCP

Link to the dataset(s)

https://kits19.grand-challenge.org/data/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a copy-paste based data augmentation technique, which uses the masks of lesions to crop them from scans and pastes them on another scan at appropiate location using masks of Liver in the target scan.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed technique can be quite powerful for some medical imaging tasks (e.g. lesion segmentation), where it can reduce the training data requirement.
    • In medical images, since the content has a lot less variability, this technique can potentially be applied very successfully.
    • The motivation of the copy-paste augmentation is quite detailed and good.
    • Results are promising.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • This idea (copy-paste augmentation) has been used in computer vision for atleast last couple of years and those proposed techniques are more advanced, e.g. InstaBoost uses patch similarity to determine where to paste the cropped content, and it uses matting to improve the consistency with the background.
    • The claim about beating state-of-the-art is not strictly correct as KiTS 19 challenge website lists better performing methods [1] posted in 2019. It would be good to clarify why those methods are not considered state-of-the-art. [1] https://kits19.grand-challenge.org/evaluation/challenge/leaderboard/
    • The paper mentions that this technique can produce artifacts. It would be nice to include some figures and discuss if matting would reduce these artifacts.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • All results seem to be on the validation set. It would have been nice to report the result of the final method on the test set as well.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    • Page 7, line 4 seems to refer to wrong Table, it should be Table 2
    • Section 3.2 almost the whole last paragraph of this section is repeated twice.
  • Please state your overall opinion of the paper

    borderline accept (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented technique is interesting and probably should receive more attention in medical imaging field, and the results are promising.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Somewhat confident



Review #2

  • Please describe the contribution of the paper

    This paper applies runtime data augmentation in order to improve upon tumor segmentation in CT images (KiTS dataset) using a state-of-the-art U-net architecture (nnUNet) for biomedical image segmentation. Region-based Tumor- as well as kidney cancer-annotations are projected on different backgrounds in addition to standard data augmentations in order to enhance the base dataset and achieve better performance. Evaluation results seem to indicate a performance improvements when compared to no data augmentation, which is even improved in case of the utilization of few data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of successfully applying copy-paste augmentation to medical images seems promising as annotating medical data is a tedious process usually requiring medical experts.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The text is very hard to understand often leaving the reader confused. This is due to many grammatical errors as well as imprecise descriptions that render many descriptions ambiguous. Specifically, the very loosely described methodology leaves a ready highly questioning the validity of the experiments conducted.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    If the code is shared after the publication, as the Introduction states, the methodology can be utilized by other researchers. However, due to the heavily randomized augmentation process during model training, the evaluation results will not be reproducible, unless the process is seeded.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Many grammatical errors throughout the text make it very hard to read and at times not comprehensible. It is advised to have a language-proficient person revise the paper. Additionally, the validity of the described approach remains unclear as crucial information is missing or written in a confusing manner. Below are detailed comments and suggestions.

    Suggestions *******

    Content

    • Though DL models act like de facto … -> It is not clear what is expressed here - Maybe “Although DL models do in fact work well …”?

    Dataset

    • The organizers keep 90 images unreleased for testing -> Does this mean those 90 images are not used for this study and essentially 210 are used? It would help to include a table listing how many images are exactly used for training, validation and testing. Additionally, the text multiple times describes the ‘random selection of data’ - for comparability, this selection should only be performed once for each comparable ablation study, which is not clearly stated.

    • For data splitting, we randomly divide training data into 5-fold. Due to limited computation resources, we report our major ablation study results on one validation fold if not specified. -> Why are create 5 folds created when only one is used for reporting results? Also, the evaluations should report results on never-before-seen, unaugmented testing data. From the description context, it seems that evaluations are done on the validation fold? If so, is this fold augmented as well? - it should not be as to reflect reality. Is this fold even the same one for the reported results? - otherwise, their comparison would be unfair.

    Methodology

    • during runtime-augmentation, given an image is chosen for augmentation - is the original image replaced or is it additionally used for training? In case it is replaced and assuming all augmentation techniques described are applied to that image, the dataset size will stay the same in any case, hence, it is not really ‘augmented’ but rather ‘altered’.

    • intra-/inter-patient Copy-Paste -> the descriptions for these terms are contradicting each other: first (2.1), the text states that ‘intra-patient’ stands for using identical images for source and target image, whereas ‘inter-patient’ uses different images (regardless of the patient). Later (3.2), states that ‘intra-patient’ uses all images of a particular patient, whereas ‘inter-patient’ specifically uses images of other patients as target. This should be made clear in the text because it is critical for understanding the validity of the approach.

    • … we use Dice-Sørensen Coefficient (Dice) score as our metric -> How exactly is this coefficient (specifically ‘Mean Dice’) applied to measure segmentation performance? Readers should more thoroughly be informed about the evaluation’s notion of true positives/negatives and type 1/2 errors.

    • It is assumed the methodology applies the random selection of images to augment with Intra-/Inter-CP randomly. If this is the case, the approaches in Table 1 can not fairly be compared as different chosen lesions have a different impact on the results. This can be verified by simply applying this augmentation strategy multiple times with the same settings - the results will vary. For a more fair comparison, the exact same images should be chosen and then combined with other instance augmentations such as Elastic, Rigid, Gamma …

    Textual

    • Motivated by the recently [reviving -> revived] “Copy-Paste” augmentation …
    • Deep learning (DL) models [work -> worked] remarkably well over the past few years …
    • [Though -> Although]
    • … demanding [more so than ever large -> ever larger] and well-annotated datasets …
    • … integrating multiple proxy tasks to [exploiting -> exploit] unlabeled medical data …
    • … dramatically affects model performance [for the -> as it introduces the] risk of overfitting to fake data.
    • [Distinct from the tend -> Contrary to the trend] of using increasingly sophisticated methods
    • … for [add: the] tumor segmentation problem.
    • … set of stochastic data [transformation -> transformations], …
    • … for different objectives [as the followings -> , described in the following].
    • … [remaining -> leaving] the coupling between foreground and background.
    • … is enhanced by randomly [sample -> sampling the] gamma parameter;
    • … As mentioned [early -> earlier] …
    • … we randomly divide training data into [5-fold -> 5-folds]. …
  • Please state your overall opinion of the paper

    reject (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I can not recommend accepting this paper for the main track of this conference, mainly due to the following discrepancies that question the validity of the approach:

    • the training process is not clearly explained
    • it is not clear but assumed that augmentation is also applied to the validation set, which introduces bias
    • it is not clear what exactly is evaluated - it is assumed a validation set is reported, which would not highlight the models’ best performance but the current loss at the corresponding cut-off epoch
    • randomly picking images to be augmented for every compared approach is not fair
  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident



Review #3

  • Please describe the contribution of the paper

    The paper evaluated the copy-paste approach for image augmentation in the context of kidney tumor segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The author did a relatively thorough investigation of the strength of copy-paste approach for image augmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I think the novelty of the proposed method seems to be questionable. First, it is a relatively trivial operation to simply copy-paste the tumor onto a normal organ. Second, the benefit of this method has already been reported in other field, ref[6]. Also, there is lack of proper comparison with other augmentation method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It seems to be reproducibile.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    First, I think it would be good to discuss how plausible the augmented tumors are, weather unplausible augmentation could have negative impact on the model performance. Second, the evaluation could benefit from including more segmentation accuracy metrics. Third, it would be good to compare this smiple augmentation approach with other more sophicated ones.

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the novelty of the proposed method seems to be questionable.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a translation of a technique of augmentation to the medical imaging world with good performance. The reviewers highlight however the relative limited novelty compared to the original proposition of the method and regret the lack of comparison with other augmentation strategies. Further, lack of details make in some places the paper hard to follow.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    8




Author Feedback

We would like to acknowledge all the reviewers for their careful reviews. Also, we would like to thank ACs for their scholarly handling of the submission.

Overall, all 3 reviewers agree on the promising results of TumorCP. Both R1 and R3 agree on good organization and clarity. R1 highlights our detailed and solid motivation, and believes it should receive more attention in our community due to its lack of exploration but great effectiveness for the data-hungry medical imaging field.

We use abbreviations to save space: obj - object; img - image; aug - augmentation; CP - CopyPaste; SOTA: state-of-the-art

  1. On the novelty (R1 and R3) CP aug [6] proposed in Dec. 2020 for natural images, but its success for medical images remains unknown. Derived from it, steps taken forward by TumorCP are as follows: 1) It is the first exploration of CP for a medical scenario: tumor segmentation. The exploration is two-fold: (a) Different from img-level aug, obj-level TumorCP copies tumors and pastes them onto diverse locations, flexibly generating plausible images. (b) Different from original CP [6], we leverage the priori that tumor should be attached to organ, so the pasting area is guided.  2) We aim to explore the domain-specific issue of “Does context really matter in tumor segmentation?”. Surrounding visual context is argued to be vital and we want to see if it is the case. For this interest, TumorCP is superior to Instaboost (mentioned by R1) which only does intra-CP within the same image (same context). In contrast, our Inter-CP can involve more diverse abdominal context across patients.   3) We ablate many design choices of TumorCP for 3D medical images (e.g. various 3D transformations & Inter/Intra setting) and benchmark them in KiTS.  We hope TumorCP will provide useful data points to our community and shed light on the importance of CP aug, which was powerful but unluckily absent in the medical imaging field.

  2. On the comparison with other methods (R1 and R3) 1) Data aug in ABDOMINAL tumor segmentation remains conservative for decades even until now. On one hand, recent methods (AutoAug/RandAug) are all img-level; they are designed for natural image classification and cannot be directly applied to medical segmentation. On the other hand, other sophisticated ones, e.g. using GANs for img-level synthesis, lack an explicit way to manipulate obj-level instances. As we intend to introduce obj-level aug for abdominal tumor segmentation, TumorCP should be compared with conventional img-level augs that are proved successful in this field. 2) In fact, we did compare TumorCP with SOTA img-level augs (Table 2). nnUNet is the leading method in KiTS, indispensably due to its carefully designed img-level augs. Our TumorCP not only obtains remarkable advances on vanilla nnUNet but also is complementary to img-level augs and further boosts nnUNet’s performance. We believe the solid gains over the SOTA model can still confirm the TumorCP’s success.

  3. On the lack of details (R2) We follow common settings in related tracks, and here make them clear. We promise a careful revision for this part. 1) Dice, one of the most basic metrics in segmentation, measures the overlap of model’s prediction A and ground truth B, formulated as Dice = |A∩B|/|A∪B|. Mean Dice is the average of all patients’ dice scores. 2) Data split: We use the published 210 images from KiTS19 and make a split of Train/Val (168/42). The Val set is never seen, i.e. we don’t use it to tune parameters nor to monitor the training process. The reported results are on this never-before-seen, unaugmented val set. 3) Aug: We do ONLINE aug so that the data diversifies on-the-fly during training but stays unaltered on disk. We believe online aug in long training is enough to cover tumors’ re-location randomness, so the comparison of different approaches is fair. Besides, TumorCP is invoked with probability (Fig.1), so the model can see both original data and augmented data during training.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed paper argues about the use of the CP augmentation and despite the known existence of such techniques in the CV world can justify why such studies are warranted. The rebuttal clarifies the main points of contention of the reviewers.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    9



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This is a relatively straight-forward application of a technique explored in the computer vision domain to medical image data. While by the authors own admission the technical novelty is small, I agree with R1 that this technique will be interesting to the MICCAI community. The evaluations are comprehensive and include normal image-level augmentations as a baseline. Although comparison to other commonly used techniques such as Mixup would have strenghened the paper, I believe the large improvements demonstrated are intruiging. Furthermore, the paper contains an attempt at explaining why this method works so well, which I find useful. The rebuttal addresses many of the major points and I am happy to recommend acceptance.

    There do appear to be some issues with clarity, which I hope the authors can fix for the final submission.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The lack of novelty is an important drawback for this work. All reviewers including the MR points this. This meta-reviewer also agrees. Authors simply use an existing technique for medical images. While this is fine when a thorough validation is provided, here authors’ experimental setup lacks comparisons with existing augmentation techniques for medical imaging as well as other datasets. I suggest authors to perform a more thorough evaluation of the copy-paste augmentation technique and then resubmit.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    16



back to top