Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Amir Reza Sadri, Sepideh Azarianpour Esfahani, Prathyush Chirra, Jacob Antunes, Pavithran Pattiam Giriprakash, Patrick Leo, Anant Madabhushi, Satish E. Viswanath

Abstract

In order to ensure that a radiomics-based machine learning model will robustly generalize to new, unseen data (which may harbor significant variations compared to the discovery cohort), radiomic features are often screened for stability via test/retest or cross-site evaluation. However, as stability screening is often conducted independent of the feature selection process, the resulting feature set may not be simultaneously optimized for discriminability, stability, as well as sparsity. In this work, we present a novel radiomic feature selection approach termed SPARse sTable lAsso (SPARTA), uniquely developed to identify a highly discriminative and sparse set of features which are also stable to acquisition or institution variations. The primary contribution of this work is the integration of feature stability as a generalizable regularization term into a least absolute shrinkage and selection operator (LASSO)-based optimization function. Secondly, we utilize a unique non-convex sparse relaxation approach inspired by proximal algorithms to provide a computationally efficient convergence guarantee for our novel algorithm. SPARTA was evaluated on three different multi-institutional imaging cohorts to identify the most relevant radiomic features for distinguishing: (a) healthy from diseased lesions in 147 prostate cancer patients via T2-weighted MRI, (b) healthy subjects from Crohn’s disease patients via 170 CT enterography scans, and (c) responders and non-responders to chemoradiation in 82 rectal cancer patients via T2w MRI. When compared to 3 state-of-the-art feature selection schemes, features selected via SPARTA yielded significantly higher classifier performance on unseen data in multi-institutional validation (hold-out AUCs of 0.91, 0.91, and 0.93 in the 3 cohorts).

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87199-4_42

SharedIt: https://rdcu.be/cyl4q

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper addresses the problem in radiomics protocols whereby discriminative image features that are marginally unstable can be filtered out because feature selection and stability assessment are done independently. By integrating these steps, these features that are still mostly stable can be preserved and used to achieve improved model performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

An interesting formulation of he stability as a regulisation term in the learning process. A feature stability function is generated by bootstrap subsampling from samples across all institutions. This is then integrated into an optimisation based on discriminability, stability, and sparsity.

There is a thorough evaluation process on 3 independent datasets. The process looks at the differences in feature sets selected by different algorithms as well as the classification performance of the selected features.

The evaluation process uses handcrafted features but the process described is general and future work could be extending it to deep features that are now also being used in radiomics.

There is also second contribution is regards to the optimisation of a non-smooth loss where descent algorithms can be sub-optimal in a (claimed) computationally efficient manner.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some further discussions of results are needed. For example, Table 2 shows that the classifier performance was slightly higher in the validation set than the discovery set. This is an inconsistent finding to other comparison algorithms and even the literature at large, where discovery has a higher classifier performance. This needs to be discussed.

Parts of the methodology are listed in the figures but not in the text.

There is no analysis of the computational efficiency of the optimisation.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Parameters and models used are listed. The paper does not state whether the 3 datasets were public or are available via research agreement or are completely private.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Overall the paper is well written and easy to read. The majority of the logic and flow are clear The following are suggestions for improvement:
1. The second contribution on the optimisation claims computational efficiency. Please provide experimental data to support this claim or remove the claim.
2. Please explain the finding that classifier performance was slightly higher in the validation set than the discovery set (Table 2). This is inconsistent to the other methods compared and usually other literature.
3. Please add a sentence on whether the formulation that includes marginally unstable features introduces a risk that the model is overall less stable compared to other models?
4. The features used seem only to be listed in Fig 1 and stated at end of Section 4.2 (the results). It would also be good to add them to the text in the methodology or as supplementary materials.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses an interesting technical problem. The proposed solution seems robust and sound, and would be of interest to other radiomics researchers.

There are some missing details in the paper that should be included to improve its quality, especially with regards to the second contribution.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

In this paper authors introduce SPARTA, an optimization method that jointly optimizes the discriminability, stability, and sparsity to provide feature selection. Authors propose to include the stability measure (IS) within the optimization process by adding to the optimization function (data fidelity + L1 penalty) an extra quadratic term \theta^T \beta \beta^T \theta responsible for the stability of the selected features.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Authors have the nice idea to merge stability of the feature selection within the optimization of a LASSO-based optimization function. The mathematical framework selected to solve the resulting optimization problem is sounded and all the hypotheses are verified in the supplementary material. From a practical perspective, finding a model that can broadly generalize is a challenging question and the results presented in the experimental section are promising. The model is tested on 3 different datasets and against 3 different classical approaches. An interesting aspect of SPARTA is that it was able to select and combine measures that would have been left out after a traditional stability screening.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
On the computational efficiency, I think further discussion is missing :
- In Algorithm 1 how to select K? do the values selected for \mu and \lambda play a role in the difficulty of the optimization problem?
- Do you have a stopping criterion or a metric to decide if the algorithm has converged? How long is this optimization?
- What is the value of \gamma? do you set it to 1/L or did you take a more conservative value?
You cannot solve (3) with a classical gradient descent but you could use a sub-gradient descent? this could be discussed is it too slow?

What are the predicaments for not using [22,23]? The optimization problem can be solved with these methods but they are quickly dismissed.

The author could have analyzed several implementations of \mathbf{R}, since it could be extended.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is reproducible, the most important parameters \mu and \lambda are provided. The numerical convergence of Algorithm 1 could be further discussed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper reads well, the presentation is concise and accurate. However, authors could strengthen their results by comparing results with more recent DL approaches. It would have been great to highlight the edge that provides SPARTA, for instance authors could have replace the final RF classifier by an MLP (potentially using a sparsity penalty on the weights) and compare the performance of the MLP that run on all the features and on the selected features of the differents models.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The message is clear, the solution proposed is interesting and the numerical experiments are convincing.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The authors proposed a new method for feature selection for image-based risk models. The developed method is able to identify a discriminative and sparse feature set which are also stable against variations, e.g. acquisition. The authors compared the performance of this new method with other feature selection methods using three different cohorts.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

*) In general this is an interesting approach because it combines three important aspects into one method to find an optimal feature set.

*) The overall experimental design is clear and conclusive.

*) The authors compared the performance of the new method to different state-of-the art methods using different cohorts.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

*) The authors used three different datasets but why were those datasets used? Why was not a dataset used, which contains several hundreds of patients?

*) It is not clear, on which data were the regularisation parameters of the new method determined?

*) It is not entirely clear on which data the stability was assessed? Also, how many bootstrap samples were used to assess the stability?

*) The authors wrote that they used another optimisation scheme for the new approach. Therefore, it would be important to know what is the effect of only this scheme compared to the state-of-the art approach. Because it could be that the differences are coming from this new optimiser.

*) The authors performed a statistical test to compare the differences of the model performances. But I am not sure if this test is suitable, I would prefer, e.g. a multi-level-based method for the statistical analysis to model the influence of the feature selection approach and the classifier, separately. Because with such an approach it would possible to calculate the effect of the classifier from the statistic.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Please see below.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

In addition to the comment above, here are some additional remarks:

*) Was a hyper-parameter tuning of the RF performed? How were the hyper-parameters of the RF defined?

*) Due to the lot of hyper-parameters of the RF, it would be interesting to use another classifier in addition, e.g. a logistic regression approach to see if the performance differences are not only coming from the classifier. *) The authors used the IS score to compute the feature stability but in this manuscript the definition of this score is missing.

*) Were the feature computation and extraction performed according to any guidelines such as IBSI?

*) Is or will the source code of this new method be public available?
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see above.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

3
Reviewer confidence

Very confident

Review #4

Please describe the contribution of the paper

This paper introduces a radiomic feature selection approach with the novel optimization function to deal with multi-institutional datasets. The primary contribution of this work is the integration of feature stability as a generalizable regularization term into LASSO. The proposed method improves the classifier performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method integrates feature stability as a generalizable regularization term directly into the optimization function. A new class of proximal algorithms is introduced for optimization. The method is reasonable. The experimental results shows that the proposed method (SPARTA) achieved higher classification performance in three multi-institutional datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

In the experiments, the paper compares the performance of the proposed method with mRMR, LASSO, and WLCX. I think more state of the art methods like deep learning should be included to demonstrate the performance of the proposed method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The checklist helps the reproducibility of the proposed method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. In the experiments, the paper compares the performance of the proposed method with mRMR, LASSO, and WLCX. I think more state of the art methods like deep learning should be included to demonstrate the advantages of the proposed method.
2. Table 2, the std of the proposed method using C2 and C3 is much larger than regular methods. It seems the feature stability term does not work for these two datasets.
3. In fig. 2, it seems like the classifier performance could improve with the increase of μ. The best μ and λ for C2 and C3 is not 0.1. Something wrong?
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the discussion of the experiments should be improved.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Review #5

Please describe the contribution of the paper

This paper proposes a new approach for stable feature selection (“stable” refers to select stable features across data collected from multi-sites). Conventional methods conduct stable screening and regression/prediction separately so they are weak at prediction. The proposed model combines the two processes into one optimization problem, which can improve the prediction performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The model is well-motivated, and the optimization formulations are simple and easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I have some concerns in the experiment design and the results. Please see my comments for details.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
This paper proposes a new approach for stable feature selection (“stable” refers to select stable features across data collected from multi-sites). Conventional methods conduct stable screening and regression/prediction separately so they are weak at prediction. The proposed model combines the two processes into one optimization problem, which can improve the prediction performance. The model is well-motivated, and the optimization formulations are simple and easy to follow.

I have some questions and concerns.
1. For each experiment, you have discovery data and validation data. From my understanding, discovery data is used for selecting stable features, and validation data is used for testing clasification performance using RandomForest classifier. Your proposed model is designed for multi-site data. However, for both C2 and C3 data, the discovery data only contain single-site data. How did you implemet your model in the single-data? It seems a problem as Eqn. 2 needs multi-site data in order to computing the instabilty scores.
2. The motivation for this work is very simple but solid. Incorporating the stable selection score into LASSO is a natural and useful idea. Has it been proposed by others? Or if there are similar works? I guess there may be some similar works as both LASSO and the stable feature selection are not new.
3. Features were selected using the discovery data and then used for classification on both the discovery data and the validation data. Ideally, the classification performance on the discovery data should be better than that on the validation data (because the discovery data is kinda trainning data). However, from your results (Table 2), all of the three cohorts show that validation performance is better than that on the discovery data. Can you explain this?
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see my comments.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

2
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a new method to jointly optimize the discriminability, stability, and sparsity during the feature selection.

The key strengths include: 1)The idea of merging stability of the feature selection within the optimization of a LASSO-based optimization function is smart. 2) The mathematical framework is sounded and 3) The results presented in the experimental section are promising. 4) All the hypotheses are verified in the supplementary material. 5) Evaluation was conducted on 3 independent datasets.

The key weaknesses include: 1) Some technique details (e.g., stability assessment, etc.) should be included. Since all five reviewers agree to accept, I would like to recommend “early accept” for this paper.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

C1 [R1, R6]) Why the validation results are better than discovery? Re. SPARTA does yield better classifier AUC validation cohorts compared to discovery in all of C1, C2, C3; albeit within the bounds of cross-validation performance. Note that comparative strategies perform marginally worse in all validation sets compared to discovery sets. Performance improvement via SPARTA may be due to identifying more generalizable and discriminable radiomic features, better accounting for imaging variations. C2 [R3, R4, R5]) Use other classifiers rather than RF. Re. SPARTA performance (AUC) via QDA in C1, C2, C3 was 0.89±0.13, 0.88±0.08, 0.86±0.05 in discovery and 0.89, 0.90, 0.88 in validation (aligning with presented RF results). Similar performance observed for LDA. Results omitted due to space constraints. C3 [R1, R4, R5]) Support computational efficiency claim. Discuss the risk of including marginally unstable features. Add features list, the definition of IS. Clarify Fig 2. Re. Analysis via SPARTA took 273, 182, 79 secs to process C1, C2, C3 in the discovery set vs 589, 351, 122 secs for alternative mRMR strategy. This + requested info will be added/updated in the final MICCAI paper. C4 [R4]) Was the feature computation done per IBSI? Will the source code be public? Re. Yes. C5 [R3, R4, R6]) How to select K, the impact of μ and λ? What is the value of γ? RF hyper-parameters? On which data were regularisation parameters determined? How many bootstrap samples on which data? Can IS be computed within a single site? Re. μ and λ do not impact computational efficiency, only algorithm convergence rate (see Eq. (4) of Supplementary Materials). K empirically selected as 100, can be increased in case of non-convergence. γ set to mid-point of convergence interval (1/(2L), see Eq. (4) of Supplementary Materials). RF hyper-parameters empirically set to 50, 50, 100; based on literature. Parameters and IS computed on discovery cohorts alone. Bootstrap subsets based on randomly sampling 50% of the discovery cohort in each iteration. IS can be computed within a site or across sites; see Leo et al (https://doi.org/10.1117/12.2217053 ). In a single site, instability quantifies the impact of batch effects and intra-site scanner variations. Relevant points will be clarified in the final MICCAI paper. C6 [R3]) Is sub-gradient descent too slow? Re. Shown in Hastie et al (https://doi.org/10.1201/b18401) that proximal algorithms (convergence rate O(1/n)) significantly faster than sub-gradient methods (rate O(1/n^2)). C7 [R4]) Why not larger datasets? Re. Datasets selected for diversity in imaging (CT and MRI), with different sources of variation (batch effects, institutional differences), and to encompass a variety of clinical problems (healthy vs diseased, response vs non-response). In total evaluated on 399 datasets, 3 diseases, 7 institutions. C8 [R5]) Comparison only against mRMR, LASSO, and WLCX, not DL. Re. Alternative feature selection approaches selected based on radiomics literature. Will consider how to integrate with DL, non-trivial as an approach designed for ML/radiomics analysis. C9 [R5]) Std AUC of SPARTA much larger on C2 and C3 than alternatives. Re. Std AUC of SPARTA on discovery sets (0.07 for C2, 0.14 for C3) is within range of alternatives (0.05, 0.11. 0.06 for C2 and 0.01, 0.12, 0.07 for C3). Performance was likely due to smaller cohort sizes for C2 and C3. C10 [R4]) Model influence of the feature selector and the classifier separately. Effect of optimization scheme vs feature selectors. Why not use [22,23] to solve optimization? Re. Presented in Fig 2, Shapley values of the 5 top-ranked selected features from different methods presented to assess feature performance independent of the classifier. Additional analysis/comparisons underway for the full journal paper.

back to top

SPARTA: An Integrated Stability, Discriminability, and Sparsity based Radiomic Feature Selection Approach