Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Shawn S. Ahn, Kevinminh Ta, Stephanie Thorn, Jonathan Langdon, Albert J. Sinusas, James S. Duncan

Abstract

Echocardiography is one of the main imaging modalities used to assess the cardiovascular health of patients. Among the many analyses performed on echocardiography, segmentation of left ventricle is crucial to quantify the clinical measurements like ejection fraction. However, segmentation of left ventricle in 3D echocardiography remains a challenging and tedious task. In this paper, we propose a multi-frame attention network to improve the performance of segmentation of left ventricle in 3D echocardiography. The multi-frame attention mechanism allows highly correlated spatiotemporal features in a sequence of images that come after a target image to be used to augment the performance of segmentation. Experimental results shown on 51 in vivo porcine 3D+time echocardiography images show that utilizing correlated spatiotemporal features significantly improves the performance of left ventricle segmentation when compared to other standard deep learning-based medical image segmentation models.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87193-2_33

SharedIt: https://rdcu.be/cyhMb

Link to the code repository

https://github.com/sa867/multiframe_attention_3d_echo

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

A multi-frame attention network is used for segmenting the left ventricle in echocardiography images. Authors show that the use of spatiotemporal information improves segmentation results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The purpose of this work is well justified. The application of methods introduced for video segmentation in spatiotemporal medical image segmentation is interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is some discrepancy and lack of clarity in the article around the methodology. Please see detailed comments.

Evaluation is not very thorough. The proposed method is not compared with the SOTA for left ventricle segmentation on echocardiography images. For example why not compare with results in ref 24? (the equivalent 2D version of results). Results are compared with segmentations of only one expert, dataset is fairly small and porcine.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Model architecture and hyperparameters are provided. However some sections in the methodology need clarification. See detailed comments. Cannot reproduce if the used dataset is not provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Please provide quantitative results in the abstract.

In 2.1 page 3, by target image frame do you mean 3D data? It seems that I_t is one 2D image at time t and the Methods section is described on ‘image’ (2D I assume?) frame. However the authors mentioned 3D+time dataset.

It is not clear what the role of the target vs reference frame is. Why are multiple references needed? It would be better if the authors first explain the general approach rather than heading straight into the details of the network. It appears from Fig 4 that increasing the number of reference frames improves segmentation (first row). I wonder if using more than 4 reference frames would also improve segmentation in the second and third rows. The authors do state this as future work but I wonder why it is not done. Are there hardware/compute limitations? The multi-frame (5 frame) results are still poor (e.g. DSC 0.641). It would help to know how the results are compared to SOTA numbers.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the transfer of temporal video segmentation to this medical domain is interesting, there are many sections that are not well defined or are unclear. Authors need to compare results with SOTA for echocardiography application.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

Authors present a multi-frame attention architecture for segmentation of left ventricle in 3D echocardiography. The multi-frame attention network uses the spatiotemporal features of echocardiography images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents a novel architecture that extracts attention information from spatiotemporal features. Performance results show improvement over existing architectures.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

In my opinion the paralell architecture for attention extraction is overdimensioned. This paper addresses an important issue, namely what the best way to use redundant and non redundant information of video sequences for image segmentation is. Authors propose a redundant architecture to solve the problem and the obligated question is if a traditional optic flow method would not be equally good. There is no discussion on this topic.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No precise reproduction is possible. There is missing informtion regarding hyperparameters, data bases, etc.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Although the idea is very good as well as the results, there is need to discuss the fundamental questions on the use of temporal differences to help segmentation.
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There is no fundamental analysis on the role of spatiotemporal analysis for image segmentation.
What is the ranking of this paper in your review stack?

3
Number of papers in your stack

5
Reviewer confidence

Very confident

Review #3

Please describe the contribution of the paper

This paper proposes a multi-frame attention network to segment the left ventricle 3D of electrocardiography images. They extract spatio-temporal information of the multiple frames that belong to the studies. The database used consists of 51 porcine 3D+time electrocardiography images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Their analysis is on 3D+time electrocardiography images, to obtain the spatio-temporal features. There are too may networks in parallel to obtain the features and in . The problem could be solved with less need of computational power.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It is not clear how the features are processed by the attention network. The theory developed in section 2.1 for the multi-frame attention network is not clear. Because of the figures shown, it was not clear if the processing is executed on a 2D+time, that later will compose the 3D segmentation, or the 3D refers to 2D+time.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Hyper-parameters of the networks are not suggested. The reproducibility of the paper is difficult because of the data set used, the images are not public available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The reproducibility of the paper is difficult because of the data set used, the images are not public available. My remaining question is: why do you think the proposed architecture work taking into account spatio-temporal information?, the answer could be elaborated. The computational complexity looks excessive, because of the parallel networks. There is not a discussion of how the same idea can or has been solved with simpler methods.
Please state your overall opinion of the paper

borderline reject (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Reproducibility is not an easy step. It makes extensive deep learning architectures without loo kin for alternative algorithms.
What is the ranking of this paper in your review stack?

5
Number of papers in your stack

5
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors propose an spatiotemporal attention model to segment 3D echocardiographic data, and show that the aggregation of multiple frames can improve segmentation results. The reviewers acknowledge the importance of the problem addressed, and that results are solid. However it is the general opinion that the architecture design is not well justified and potentially over-dimensioned, introducing perhaps unnecessary complexity compared to other simpler methods, which the authors should include in the discussion. Reviewers also ask for more clarity on the methodology seciton.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

6

Author Feedback

We thank the three reviewers (R1-3) for taking the time to review our submission and provide us with constructive comments. We appreciate the positive feedback on the novelty of our work in the domain of echocardiography. The main criticisms from the reviewers were the lack of justification of our multi-frame attention design and the relative lack of clarity in the methods section. We would like to address those issues, respectively, by: 1) elaborating on our rationale for using multi-frame attention (R1-3), 2) clarifying the seemingly complicated network design choice (R2-3), and 3) elaborating on our methodology section (R1, R3). 1) Our work proposes to take advantage of the high temporal resolution of echo that can allow many subsequent frames to inform the frame of interest where the left ventricle is located by looking at spatiotemporal consistencies. We utilize the spatiotemporal consistencies by computing attention maps, specifically at the feature level, that represent highly correlated regions of feature maps between the frame of interest and other frames that follow as guides. We will elaborate on this rationale at the beginning of the method section before jumping straight into the details of the network. Furthermore, as part of our rationale for our architectural design choice, we would like to add that previous work using optical flow-based segmentation methods rely on the major assumption of brightness consistency. However, echocardiography has inherently low SNR and granular patterns which make it difficult to obtain good motion estimates using optical flow, which in turn make it difficult to use in segmentation. Others have looked at non-DL approaches using similar spatiotemporal consistencies for LV echo segmentation (Huang et al, cited below), but they require manual tracings to initialize the sparse representations and appearance dictionaries, which make it computationally expensive and slow. For comparison, it takes ~20 sec for our model during inference, ~10 sec for UNet3D, ~60 sec in Huang et al. We will include this rationale and reference in our discussion section. 2) A major point of concern is the over-complexity of our network design, as it is shown in Figure 2 of our paper (R2, R3). We acknowledge the confusion the figure may have caused the reviewers. We would like to clarify that the encoder parts as well as the attention modules are all shared weights rather than a combination of five different parallel networks. Therefore, the number of input volumes can be flexible to incorporate multiple time frames as reference/guide frames, which is shown in results. We will modify our figure to make note of the shared weights as well as simplify our schematic. 3) Another concern was the lack of clarity in our methodology section (R1, R3). Our work is based on 3D+time echo sequences. The inputs to the network are 3D volumes within the 3D+time echo sequence. I_t is a 3D volume with the dimension of width x height x depth. We will modify our wording to change “image” to “volume” to remove any source of confusion as well as modify our Figure 2 to show the 3D volume input instead of a cross section of the volume. For minor comments, R1 points out the lack of evaluation using ref. 24. The work in ref. 24 is on 2D echo segmentation, while our work aims to segment the left ventricle in a 3D+time echo sequence. In addition, there is no publicly available code for ref. 24 which makes it difficult to replicate their SOTA results into our 3D+time dataset. However, we plan to compare our work on 2D+time echo as well and include the evaluation results from ref. 24 in the future. Lastly, R2 states that our paper does not discuss hyperparameters, which is contradictory to R1’s comments. We would like to point out that they are discussed on page 5, section 3.2. Huang, Xiaojie, et al. “Contour tracking in echocardiographic sequences via sparse representation and dictionary learning.” Medical image analysis 18.2 (2014): 25

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors present a method for 3D ventricle segmentation integrating information from all frames in a 3D+t sequence. All reviewers had acknowledge value in the paper but were concerned at the method which seemed exceedenly (and unnecessarily) complex. In their rebuttal, they have clarified that the model is indeed not that complex, because the weights of the 5 input branches are shared. I believe this would have sufficed to turn the “possibly reject” reviews into “possibly accept” so I recommend accept.

As a side note to authors, in they rebuttal they state that “echocardiography has inherently low SNR and granular patterns which make it difficult to obtain good motion estimates using optical flow”. This is not true: the “granular patterns” are in reality speckles that actually help tracking motion because they are consistent across frames, and are behind speckle tracking techniques which have proven very useful in cardiac motion estimation from echo.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work tackled the problem of segmenting 3D echocardiographic data, which is very important. However, the fundamental analysis and explanation of the method is not clearly discussed, which is concerned by Reviewer#1 and #3. Moreover, the comparison results are missing as the Reviewer#1 raised. The methodology should be further clarified.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

13

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors present a multi-frame attention architecture for segmentation of left ventricle in 3D echocardiography, which showed good performance. The authors have properly addressed the concerns on the architecture design and complexity.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

back to top

Multi-frame Attention Network for Left Ventricle Segmentation in 3D Echocardiography