Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Fernando Pérez-García, Catherine Scott, Rachel Sparks, Beate Diehl, Sébastien Ourselin

Abstract

Detailed analysis of seizure semiology, the symptoms and signs which occur during a seizure, is critical for management of epilepsy patients. Inter-rater reliability using qualitative visual analysis is often poor for semiological features. Therefore, automatic and quantitative analysis of video-recorded seizures is needed for objective assessment.

We present GESTURES, a novel architecture combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn deep representations of arbitrarily long videos of epileptic seizures. We use a spatiotemporal CNN (STCNN) pre-trained on large human action recognition (HAR) datasets to extract features from short snippets (≈ 0.5 s) sampled from seizure videos. We then train an RNN to learn seizure-level representations from the sequence of features.

We curated a dataset of seizure videos from 68 patients and evaluated GESTURES on its ability to classify seizures into focal onset seizures (FOSs) (N = 106) vs. focal to bilateral tonic-clonic seizures (TCSs) (N = 77), obtaining an accuracy of 98.9% using bidirectional long short-term memory (BLSTM) units.

We demonstrate that an STCNN trained on a HAR dataset can be used in combination with an RNN to accurately represent arbitrarily long videos of seizures. GESTURES can provide accurate seizure classification by modeling sequences of semiologies. The code, trained models and features dataset are available at https://github.com/fepegar/gestures-miccai-2021.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_32

SharedIt: https://rdcu.be/cyl56

Link to the code repository

https://github.com/fepegar/gestures-miccai-2021

Link to the dataset(s)

https://doi.org/10.5522/04/14781771

Reviews

Review #1

Please describe the contribution of the paper

This work collects a new seizure video datasets fro seizure recognition. It combines the CNN and RNN to model the spatial-temporal features for snippet- and seizure-level classifications. The final accuracy reaches to 98.9%, satisfying the clinical application to some extent.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-The newly collected dataset is well formulated including the acquisition process and ground-truth definition. If the dataset can be publicly available, this work will be of greater value. -The code and extracted feature vectors will be available for future research. -This work introduces a new and simple task for seizure recognition by distinguishing between FOSs and TCSs, which can be applied to infer epilepsy type.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Limited method novelty: the framework is a simple combination of existing work [13] and RNN modules. The only difference from [13] is the RNN added behind the STCNN. The techinical contribution cannot meet the demand of MICCAI acceptance.
-Experiments 1) This work shows no experimental results to demonstrate the effectiveness of transfer learning from large HAR dataset to seizure dataset. What’s the accuracy of model trained with only seizure dataset, ie, training STCNN with seizure videos rather than HAR videos? 2) I am not convinced of using R(2+1)D-34(8) as the extractor. As shown in Table1, R(2+1)D-34(32) achieves much better performance across all metrics. And the computational cost is acceptable. Why not use R(2+1)D-34(32) for a better performance?
- Confused logic and organization: Fig.1 should be presented in session 2 rather than session 3; Many method details are ignored in session 2, e.g., the description of STCNN, the architecture of RNN, etc; Some explanations (e.g., reason for use R(2+1)D-34(8)) should be firstly shown in session 3 instead of session4.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is reproducable since the code will be released soon.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
- As shown in Fig.2, BLSTM achieves better performance than LSTM when n =16. Any insight or explanation for this? For me, the utilization of BLSTM is not convincing as LSTM shows better performance when n = 4, and 8.
- It is better, or necessary, to show the images and videos of the newly colllected dataset. Otherwise, the readers cannot have a common sense of the new dataset.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main weaknesses lead to the borderline reject of this paper. However, I appreciate the new task and dataset for seizure recognition, which is valuable for medical community.
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

This work is a solution to the tedious and error prone procedures needed to develope super-vised DL or other ML (ultimately) predictors of seizure. Within a long video acquisition, the evolution of seizure manifests in varying ways. The goal for the authors is to build upon methods from human position tracking and apply seizure surveillance. Methods are well described and results shown from meaningful test corpora.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is a well written and clear paper that presents important work. The use of spatio-temporal models trained from HAR datasets is particularly novel and has utility in many medical applications with video elements (eg: postural sway, gait analysis, stroke, etc). The authors’ methods are designed to be robust against realities of video capture and thus is a nice exemplar of analytical work overcoming real-life challenges.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Not many. I found some of the methods discussion less clear than optimal; specifically the sampling strategy and distribution modeling for fe
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibability is good/fair. Assuming with the github repos and sample datasets, one could replicate.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Page 4- the annotators worked frame by frame.. lots of work! is there some time sampling period that might make annotation easier (eg: 1 second, 5 seconds) in which feature changes of the seizure wont change drastically..

Page 4 - I am confused by the duration of video samples; Confused - the total duration of all the segments/videos or is this at a patient level?
Please state your overall opinion of the paper

strong accept (9)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Transfer learning applied to difficult and important surveillance problem. Seizure monitoring and quantitation is expensive, time-consuming and inexact today. These methods hold great promise for many diagnostic and therapeutic innovations.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

The paper proposes a way to classify focal onset seizures (FOSs) from focal to bilateral tonic-clonic seizures (TCSs) using a combination of spatiotemporal CNN (STCNN) and RNN.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Proposes an effective way for seizure classification
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No thorough testing
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

moderate.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The paper proposes a way to classify focal onset seizures (FOSs) from focal to bilateral tonic-clonic seizures (TCSs) using a combination of spatiotemporal CNN (STCNN) and RNN. It is an interesting paper.

Below are my comments. It is mentioned that SFCNN were trained on the ImageNet dataset. Is that really needed as in the medical data there wont be those many labels as in ImageNet? Or is any finetuning done to the learned models? What would have happneend if these SFCNN be learned using the data in hand or a new simpler model is created extactly for this?

In the comparison between three aggregation methods which are mean, LSTM and BLSTM, why sithe BLSTM was performing better than the other two? Is there any case where the mean method was right but the BLSTM gave incorrect results?

Minor comment: Enlarge Fig 1(a) as many portions are clear.
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Its an interesting paper with some good results.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper was overall well written and organized. The authors collect new seizure video datasets for seizure recognition. It combines the CNN and RNN to model the spatial-temporal features for snippet- and seizure-level classifications. The final accuracy reaches 98.9%, satisfying the clinical application to some extent.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

Author Feedback

We are encouraged that reviewers find our work “holds great promise for many diagnostic and therapeutic innovations” (R2), is “interesting” (R3), and is “valuable for the medical community” (R1). R2 finds our approach “well described” and “a nice example of analytical work overcoming real-life challenges”. Reviewers appreciate our efforts on dataset collection and description (R1), and on reproducibility (R1, R2).

Novelty “The only difference from [13] is the RNN added behind the STCNN” (R1). We strongly disagree. In [13], a single brain lobe was predicted for each patient from 2-second snippets of infrared (IR) videos, after cropping with an ROI defined on the first frame (Section 1, p. 2). We predict seizure type (and assume a patient can present multiple seizure types) from videos of arbitrary length comprising one full seizure. We do this by developing a sequential model that takes a sequence of snippets sampled from the video. Furthermore, we do not crop the videos, and evaluate on both IR and RGB video data.

Validation “No thorough testing” (R3). We tested four different feature extractors on their ability to represent snippets of three different durations. We matched the numbers of parameters and layers between the different models and evaluated complexity vs. performance. We evaluated three aggregation methods, four different numbers of segments and five different sampling distributions. As we performed 5-fold cross-validation, we tested 300 models.

Pre-training R1, R3 highlighted that we did not compare feature extractors (CNNs) trained only on our dataset vs. trained on large-scale human action recognition (HAR) datasets. In this work, we did not train any CNNs (Sections 2.3 and 3.1). We purposely selected networks trained on over 65 million videos or 14 million images (trained by the Facebook Artificial Intelligence Research (FAIR) group) as they should yield better performance than training with under 500 seizure videos. This is supported by evidence in the literature (see [10]). Moreover, by training the RNNs only on feature vectors, which we are allowed to (and will) share, we make our work fully reproducible.

BLSTM versus LSTM R3 was interested in why the BLSTM gave the best performance, and R1 correctly highlighted the LSTM gives better performance under some conditions. We selected two of the most common aggregation techniques: mean and LSTM. Additionally, we tested BLSTM as it typically marginally improves results due to higher model complexity and incorporating information from the past and future at each timestep. In the manuscript, we mention the specific parameters that gave the highest performance for our dataset (Section 3.2) and in the discussion (p. 8) state “Using LSTM or BLSTM units to aggregate features from snippets improved accuracy compared to averaging”, emphasizing the importance of aggregating features using a sequential model as opposed to averaging.

Feature extractor choice “R(2+1)D-34(32) achieves much better performance […] Why not use R(2+1)D-34(32) for a better performance?” (R1). Tran et al. 2018 (where the two CNNs were presented), showed that R(2+1)D-34(32) achieves similar performance with a much shorter training and inference time (5x). We also achieved similar results for both models (difference of 3 percent points) and therefore chose R(2+1)D-34(8), as this is the model that would be used in practice due to the reduced computational burden.

Data labeling “The annotators worked frame by frame.. lots of work!” (R2). We stated in Section 2.2 that we annotated “for each seizure the following times: clinical seizure onset t_0, onset of the clonic phase t_G (TCSs only) and clinical seizure offset t_1”. The ground-truth label for each frame t_k was computed from these times.

Figures “It is better, or necessary, to show the images and videos” (R1). To protect patients’ privacy, we are not allowed to share the raw data except with explicit patient consent (as in Fig. 1).

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

the paper has strength including a sufficient methodological difference from previous works as the authors explained in the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed most of the concerned raised in the review process. If the paper is accepted, they are encouraged to address the concerns in the final paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

14

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

First, summary of the reviews: 1) RNN is added to process feature maps of STCNN for two types of seizure (FOS and TCS) classification; 2) STCNN was not trained by a follow-up Fully connection layer was trained to get the feature maps; 3) mean, BLSTM, LSTM aggregation was not explained clearly (some confusion about the results as well); 4) reviewer #2, some description not clear. Based on the details of reviews, I would give reviewer 1 more weighting. The rebuttal mostly clarified the contribution, but some validation such as using R(2+1)D-34(8) without comparing with R(2+1)D-34(32) is missing.

My comments about the paper: 1) the FOS and TCS video clipped were well preprocessed manually; so this paper is about how to classify such seizure videos to two types; “arbitrarily long video” actually means the number of frames of such seizure videos is different; 2) it is in this context that the technique is very useful to distinguish two types of seizures, but is not desired to automatically pickup seizure clips, which is fine; 3) I am a little confused about patient-level, snippet-level classification, and seizure-level classification (isn’t this for snipped-level also?). I think patient-snippet-frame is a set of terms, so confused about Sections 2.3 and 2.4. I spent a lot of time to figure out what exactly is the output of Section 2.3, but failed, if it is FOS and TCS classification also, please say it clearly; 4) authors mentioned “two views” of the videos as inputs but never explained, is it small screen? Or stereo?; 5) splitting the samples in cross-validation not based on patient level is not acceptable to me because the same kind/style/people in videos appeared in training and testing.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

5

back to top

Transfer Learning of Deep Spatiotemporal Networks to Model Arbitrarily Long Videos of Seizures