Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Giacomo Nebbia, Saba Dadsetan, Dooman Arefan, Margarita L. Zuley, Jules H. Sumkin, Heng Huang, Shandong Wu

Abstract

Convolutional Neural Networks (CNNs) are traditionally trained solely using the given imaging dataset. Additional clinical information is often available along with imaging data but is mostly ignored in the current practice of data-driven deep learning modeling. In this work, we propose a novel deep curriculum learning method that utilizes radiomics information as a source of additional knowledge to guide training using customized curriculums. Specifically, we define a new measure, termed radiomics score, to capture the difficulty of classifying a set of samples. We use the radiomics score to enable a newly designed curriculum-based training scheme. In this scheme, the loss function component is weighted and initialized by the corresponding radiomics score of each sample, and furthermore, the weights are continuously updated in the course of training based on our customized curriculums to enable curriculum learning. We implement and evaluate our methods on a typical computer-aided diagnosis of breast cancer. Our experiment results show benefits of the proposed method when compared to a direct use of radiomics model, a baseline CNN without using any knowledge, the standard curriculum learning using data resampling, an existing difficulty score from self-teaching, and previous methods that use radiomics features as additional input to CNN models.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_61

SharedIt: https://rdcu.be/cyl6B

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    Current deep learning models in breast imaging have difficulty in incorporating clinical information. Here, the authors proposed a novel curriculum learning using radiomics as an additional source of information based on the difficulty of classification. The method was tested CBIS-DDSM database. Their approach showed improved performance over competing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is a clever approach to incorporate relevant clinical information into curriculum learning.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The difficulty of classification is evaluated with a radiomics score. However, there could be cases where easy cases are wrongly classified as difficult ones or vice versa. How would you handle this?
    2. Radiomic scores could be built using various approaches. Do radiomics scores differ significantly and how do they affect the curriculum learning?
    3. There is other clinical information such as age, sex, etc. These should play a part in the radiomics score model.
    4. They only tested two curriculums. How can you customize them in a breast imaging context?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Acceptable

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Summary: Current deep learning models in breast imaging have difficulty in incorporating clinical information. Here, the authors proposed a novel curriculum learning using radiomics as an additional source of information based on the difficulty of classification. The method was tested CBIS-DDSM database. Their approach showed improved performance over competing methods. Strengths: This is a clever approach to incorporate relevant clinical information into curriculum learning. Weakness:

    1. The difficulty of classification is evaluated with a radiomics score. However, there could be cases where easy cases are wrongly classified as difficult ones or vice versa. How would you handle this?
    2. Radiomic scores could be built using various approaches. Do radiomics scores differ significantly and how do they affect the curriculum learning?
    3. There is other clinical information such as age, sex, etc. These should play a part in the radiomics score model.
    4. They only tested two curriculums. How can you customize them in a breast imaging context?
  • Please state your overall opinion of the paper

    strong accept (9)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides an important thrust going forward.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    This paper defined a new radiomics difficulty score as a source of additional knowledge to augment deep learning training, and proposed a different training scheme guided by both the radiomics score and customized curriculum learning strategies. Within the proposed scheme, the radiomic score is used as the weight of loss function and can be continuously updated based on our customized curriculums to enable curriculum learning. The proposed method achieved better performance on a public breast cancer dataset when compared to other methods, including the traditional radiomics model, CNN-based methods, and standard curriculum learning methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • A good way of using conventional machine learning algorithm to capture the difficulty of classifying a set of samples • Propose a different curriculum learning strategy guided by a dynamic radiomic score • Define a simple but effective customized curriculum to dynamically adjust the weights of radiomic score.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Lacking of some details about radiomics baseline, because if the same logistic regression classifier for radiomic score is used, of course the performance of traditional radiomics model will be worse than the deep learning ones. In addition, logistic regression itself is not a good choice for classification problem. • Only one evaluation metric is used for this classification problem, it would be better if other common and well-established evaluation metrics could be used for performance evaluation • Only four experiments present statistical significance of improvement over the baseline

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Basically, the proposed method is not difficult to reproduce and enough implementation details have been provided. But it would be better if more details of traditional radiomics part (e.g., extracted radiomics features, parameters of logistic regression classifier, etc.) could be given, since the radiomic score is one of the key components in the proposed method. Also, more details about data pre-processing are needed to replicate the paper

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Overall, this paper proposes a good training strategy combining the traditional radiomics components and deep curriculum learning techniques. Here are some suggestions which may help improve the quality of this paper: • The details of radiomics model baseline are needed for comprehensive evaluation • It would be better to include more evaluation metrics (e.g., accuracy, sensitivity, specificity, f1-score, etc.) for performance measurements • Add some details about data pre-processing and the newly defined radiomic score calculation (e.g., extracted radiomics features, parameters of logistic regression classifier, etc.) so that this method will be easier to reproduce. • Please try to explain why the p-value is relatively large for most of the experiments, does this indicate that your proposed method not improve over the baseline from a statistical perspective.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    • A good usage of traditional radiomics knowledge to facilitate deep curriculum learning, the definition of radiomic score is reasonable and proper for such diagnosis problem • The dynamic customized curriculum learning strategy is simple but effective

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The paper proposes a method for incorporating radiomics-based difficulty scoring for curriculum learning. It analyzes multiple architectures and finds superior results on the evaluated dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea to incorporate radiomics-based scoring for curriculum learning seems novel and interesting, as it might allow to make use of the recent developments in the field of radiomics for deep learning.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In my opinion, the paper lacks some clarity especially with respect to the experimental section. Some of methodological parts seemed arbitrary to me, and details I found relevant have unfortunately been missing in the description as pointed out below.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Notably, the authors state to include their full training code, models and description upon acceptance, however, they did not submit it together with the paper. Although not necessary, it would have been much easier for me to clarify some of the questions below if the relevant code would have been provided, too, and thus would have weakened some of the doubts I felt about the submission.

    Some of the inclarities I felt while reading the experimental part would make it difficult for me to reproduce the results, e.g. the radiomics setup, the hyperparameter optimization and finally chosen hyperparameters, etc., as pointed out below.

    In contrast to the authors declaration, the submission does not include data showing variation or error bars, the statistical significance assessment was incomplete (missing test values and no precise p-values), no failure-case assessment has been conducted (but an overall assessment of the pros/cons of the schedules), no description of the hardware used and only a part of the description of the software used (e.g. pyradiomics version missing) was included. I would therefore like to ask the authors to either include these into their submission or to change their responses in the reproducability sheet.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    As pointed out above, the authors address an interesting and important topic with a novel idea which could especially improve cutting edge algorithms for squeezing out a few more percent of final performance.In my perspective, however, unfortunately the submission has major downsides which should compellingly be improved before submission:

    • The state-of-the-art discussion felt rather thin for me. There is a variety of recent work in the field of curriculum learning, including the choice and design of learning schedules, which was not addressed, such as [1,2,3]. The basic work from Bengio [4] was not addressed at all, and while the work from Weinshall was referenced (Ref. 13), a notable publication discussing a variety of issues addressed within the submission at hand, too, could potentially have been added [5]. I would have appreciated to see the submission setting itself into contrast: what are similarities, what are difference to recent work?
    • Some of the claims felt rather bold to me, and should at least be backed up with citations or additional experiments, e.g.: – “[…] or by weighing different features in the loss function [4]. These methods are challenging for deep learning models because the features are automatically learned in deep learning.”, p.1. - Why so? Especially modifying the loss function is common sense in deep learning and widely applied. – “This poses a need to investigate new strategies to incorporate domain knowledge into deep learning”, p.1,2 - There is a variety of approaches addressing this issue, some even cited by the authors, and with information fusion even posing a field of research on its own. Without a clear definition of “domain knowledge” this sentence should definitely been weakened. – “Table 1 indicate that (1) our methods of incorporating knowledge boosts the classifi-cation performance, (2) our proposed difficulty score-guided curriculum training makes better use of the score than the resampling strategy does, and thus the observed higher classification effects, and (3) our curriculum-guided training strategy uses the information captured by radiomics more effectively than simply concatenating the radiomics features into deep learning.”. In my opinion, all of these claims should be weakened, as the results were conducted with only a few architectures on only one dataset with only one repetition and using only one specific split. I would furthermore recommend to not mix up the results sections with the results discussion, as this makes it difficult to separate hard facts from interpretation.

    • Additionally, I faced a variety of inclarities while reading the submission, which I will point out in the following: – On p.2 it is stated that “Radiomics features are pre-defined and hand-crafted imaging features that can be extracted from segmented areas to capture micro-structural information”. In fact, radiomics can be (and often is) applied without segmentations, too. Similarly this weakens the limitation at the end of the submission where it is stated that “It should be pointed out that our method requires availability of segmentations in order to compute radiomics”. – Neither the used radiomics features nor their dimensionality are described at all within the submission. It is only stated that “Typical radiomics features include shape, textures, first order features, grey-level based features, etc.”. The authors should definitely add a description of the used features as well as their dimensionality, and whether they applied a preselection or similar, e.g. mRMR as suggested in [7]. – On p.4 it is stated that the weights of easy samples are decreased, and define that these are samples with a RadScore over .5. Where does this definition stem from? What is the threshold for each of the categorizations “easy”, “intermediate” and “hard” which is also used in the later evaluation, e.g. in Table 3? – On p.4 the authors write “The upper and lower bounds for Curriculum 2 are chosen based on the maximum and minimum score values in the training set.” What is the intuition behind that? What are these values across the chosen dataset? – On p.5 it is stated that “Specifically, in every epoch, each image is sampled (without replacement) with probability proportional to its difficulty score, so that easier samples are more likely to be seen at the beginning of each epoch.” This however seems to be a huge difference to the author’s contribution, and makes the results nearly non-comparable, as according to Fig. 2 not all samples are necessarily used in each epoch at all in the authors method. Introducing difficult samples in the early epochs makes it very unlikely that the resampling baseline can evenly benefit from any curriculum. – On p.5 it is written that: “Specifically, we concatenate the radiomics features to the second to last fully connected layer, so that the radiomics features and “deep” features are com-bined to generate a final classification.” I am not quite sure whether the authors concatenated the features to the second, or last or all fully connected layers in between. Additionally I am not convinced that a concatenation in the last FC layer can successfully be utilized by the network. Did the authors do any evaluation on where to best add the additional information? – The authors state to resize the images if needed. This in turn means that the network has to handle a variety of different physical representations of the input data. I would highly recommend to either always resize or to simply crop the images all equally, as otherwise it is not unlikely that the task is made more difficult for the network by that. Furthermore, the order in the submission implies that Radiomics features are extracted after the resizing step. If so, this would basically mess up the radiomics features. – On p.6 the authors state: “we consider the model that achieves the lowest validation loss after all loss weights are constant”. What does that mean? – On p.7: Which radiomics features were important for the scoring and classification? How was the radiomics model trained?

    Minor comments: – Although the relevant software is cited, there is no cite of the original radiomics publications like the ones from Kumar & Aerts. – The end of the introductory section states comparisons 1)-4) and then contributions 1)-3) immediately after that. I would recommend to at least change the enumeration of one of them, or to restructure a little. – It seemed somewhat complicated to introduce an indicator function if a label y would have done it similarly in the RadScore function. Also, not the image I_n itself is positive or negative, but rather the label assigned to it. I would recommend to change this accordingly. – On p.4 it is stated “The two proposed curriculums represent and also extend the intuitive implementations of the “easy first, hard second” curriculum learning.”. Generally I found this kind of schedule referred to as “easy-to-hard” schedule in the reference literature. I would recommend to stick to terms which are shared across the community, or to the one defined in a specific, explicitely mentioned reference literature. – Why did the authors limit the number of epochs rather than using early stopping if having a validation dataset?

    References: [1] Graves, Alex, et al. “Automated curriculum learning for neural networks.” international conference on machine learning. PMLR, 2017. [2] Jiang, Lu, et al. “Self-paced curriculum learning.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 29. No. 1. 2015. [3] Matiisen, Tambet, et al. “Teacher–student curriculum learning.” IEEE transactions on neural networks and learning systems 31.9 (2019): 3732-3740. [4] Bengio, Yoshua, et al. “Curriculum learning.” Proceedings of the 26th annual international conference on machine learning. 2009. [5] Weinshall, Daphna, Gad Cohen, and Dan Amir. “Curriculum learning by transfer learning: Theory and experiments with deep

  • Please state your overall opinion of the paper

    reject (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While having an interesting idea, the methodology, the evaluation and the paper clarity are in my opinion not yet ready for publication as pointed out above.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    7

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This papers propose to use curriculum learning for breast cancer diagnosis, automatically ordering images using a radiomics model. All reviewers recognized the relevance of the proposed approach to solve the task at hand and experimental results. Reviewer 3 emitted some reservations concerning some arbitrary methodological choices as well as requested clarifications concerning affirmations in the test. I strongly recommend the authors to address these in the final version of the paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    3




Author Feedback

N/A



back to top