Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

# Authors

Hadrien Reynaud, Athanasios Vlontzos, Benjamin Hou, Arian Beqiri, Paul Leeson, Bernhard Kainz

# Abstract

Cardiac ultrasound imaging is used to diagnose various heart diseases. Common analysis pipelines involve manual processing of the video frames by expert clinicians. This suffers from intra- and inter-observer variability. We propose a novel approach to ultrasound video analysis using a transformer architecture based on a Residual Auto-Encoder Network and a BERT model adapted for token classification. This enables videos of any length to be processed. We apply our model to the task of End-Systolic (ES) and End-Diastolic (ED) frame detection and the automated computation of the left ventricular ejection fraction. We achieve an average frame distance of 3.36 frames for the ES and 7.17 frames for the ED on videos of arbitrary length. Our end-to-end learnable approach can estimate the ejection fraction with a MAE of 5.95 and R^2 of 0.52 in 0.15s per video, showing that segmentation is not the only way to predict ejection fraction. Code and models are available at https://github.com/HReynaud/UVT.

SharedIt: https://rdcu.be/cyhV5

# Reviews

### Review #1

• Please describe the contribution of the paper

The paper show a methodological improvement on phase detection (ED/ES) in cardiac echo image sequences and ejection fraction estimation by using a transformer network.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It was a true pleasure for me to read this paper (!) and I congratulate the authors on this nice piece of work. The manuscript is extremely well written and has a high quality w.r.t. the evaluation. The proposed method follows recent trends in time-series processing (usage of transformer networks). The authors show the benefit of their novel approach in comparison to other existing work (Table 1).

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is no major weakness that comes into my mind.

• Please rate the clarity and organization of this paper

Excellent

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I could not find a statement in the paper that the source code will be released (which is encouraged). However, an open data set was used and the clarity of the description is high, therefore, I would rate the reproducibility as high.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Just minor/out of curiosity: From a reader’s perspective I would like to hear more about the performance of the ED/ES frame detection on the portion of the sequences which were not labeled. For patients with a regular heart beat (non-arrhythmia), is it possible to show that there is a regular pattern in the phase occurance (e.g. distance between the target frames)?

strong accept (9)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
• methodological contribution
• very good evaluation
• clarity of the paper
• What is the ranking of this paper in your review stack?

1

• Number of papers in your stack

5

• Reviewer confidence

Very confident

### Review #2

• Please describe the contribution of the paper

The paper presents a method for the automatic interpretation of echocardiography videos, both the ED/ES frame selection and LVEF direct estimation. This method is based on a novel deep learning architecture incorporating both encoding for dimensionality reduction (ResNetAE) and a recurrent network used in NLP (BERT).

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strengths of the paper are the novel neural network construction, extensive apparent validation, and ablation study.

• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weaknesses of the paper are in the ignorance to relevant prior work and unfair comparison to cited prior work.

-To the best of our knowledge…paradigm of discrete frame processing with limited temporal support – see Qin MICCAI 18 in MR (https://doi.org/10.1007/978-3-030-00934-2_53), Li MICCAI 19 in echo (https://arxiv.org/abs/1907.11292 ) and Wei MICCAI 2020 (https://link.springer.com/chapter/10.1007/978-3-030-59713-9_60 ) also on echo. Even the reference [15] itself uses r2plus1d_18 for direct estimation and which does convolve across time.

-We have discussed probably the first transformer architecture that is able to analyse US videos of arbitrary length – other than [14/15] that you reference. It is unclear to this reader whether the comparisons to this prior work are fair.

-It’s unclear how your Video sampling process works in the context of test-time and full video. Maybe supplemental table 1 answers the issue.

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This work may be difficult to reproduce. The hyperparameters for the ResNetAE are not specified. The Video sampling process is unclear for test time.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

Small issues:

-LVEF is the ratio between End-Systolic (ES) and End-Diastolic (ED) blood volumes in the left ventricle – technically the complement of that ratio, no?

-[14] and [15] are the same reference.

-In Table 1 caption, use  and ‘’ for open and close quotes.

-For Table 1 LVEF prediction you report [14/15]’s MAE as 7.35. I guess this comes from Extended Data Table 2, which is their r2plus1d_18 result with the whole video inferred. But your methods R and M seem to be clips centered around the labelled ED/ES frames, at least as explained in your video sampling section. The more appropriate comparison would be their “32 frame sample,” which is 4.22 and doesn’t require the labelled ED/ES frames, at least as I understand it.

-Aren’t the clips in EchoNet usually ED-ES?

probably reject (4)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors are the lack of reference to highly relevant prior work, and a seemingly unfair comparison to cited work.

• What is the ranking of this paper in your review stack?

3

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

### Review #3

• Please describe the contribution of the paper

The authors propose a combined CNN-transformer architecture for ED/ES frame detection and LVEF estimation in apical four-chamber echo videos. The per-frame encodings of the echo cine by a ResNet are fed into a BERT module to predict the targeted measurements. The public dataset of Echonet-Dynamic is used for the experiments.

• Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
• The paper has a good organization and the method is easy to comprehend.
• A public dataset is used for the experiments and the method has high reproducibility.
• Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
• Technical novelty: – Application of video transformers for video classification/regression may not be quite novel and has been explored in the community [1,2,3,4].
• Technical soundness: – There is a concern about the way LVEF is calculated. In the proposed architecture, the LVEF is obtained by averaging the LVEF predictions across the frames of the video (see Ejection Fraction Regressor block in Fig 1 and see the first 4 lines of page 5). This seems not to be clinically and technically correct. The LVEF correlates to the variations of the LV across the video. Calculating the video LVEF by averaging the LVEF predictions over the frames may not be justified.
• Results: – The proposed method noticeably underperforms the current literature (see table 1- LVEF estimation’s R2 score on the left side). The method has about 27% less R2 score (0.81 vs 0.64) compared to an existing method. The above-mentioned technical problem might be a reason for the low R2 score.
• Experiments: – The results reported in table 1 for LVEF (right side) may not be an accurate comparison. The method R14 (reference [14] in the paper) uses the LV segmentation results throughout the video to find the beat-to-beat cardiac cycles. R14 runs the video LVEF calculator across multiple detected cycles and reports the average LVEF among automatically detected cycles. This way the method R14 is already compatible with variable length videos, as the cycles are automatically detected based on LV segmentation. In this setup, it seems the reported results in the paper (table 1 - LVEF prediction - right side) for full video processing are different from the full video results reported in R14. – The method R14, uses 32 frames in each segment of the beat-to-beat LVEF estimation network. Another valid comparison is running method R14 (Resnet 2D+1) with 128 input frames, the same input frame length used in this paper.

References: [1] Video Action Transformer Network, https://openaccess.thecvf.com/content_CVPR_2019/papers/Girdhar_Video_Action_Transformer_Network_CVPR_2019_paper.pdf [2] Video Transformer Network, https://arxiv.org/pdf/2102.00719.pdf [3] Late Temporal Modeling in 3D CNNArchitectures with BERT for Action Recognition, https://arxiv.org/pdf/2008.01232v3.pdf [4] EchoBERT: A Transformer-Based Approach for Behavior Detection in Echograms, https://ieeexplore.ieee.org/abstract/document/9281296

• Please rate the clarity and organization of this paper

Good

• Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The provided details seem to be adequate and the method has good reproducibility.

• Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

The adaptation of transformers for echo video processing seems to be an interesting path to explore; however, the paper at its current format seems to need technical improvement to properly address the tackled problem.

• The method for trimming the video (into 128 frames) could be improved. Mirroring or random sampling may be a naive approach for video trimming in echo and may not be meaningful when noticing the underlying cardiac cycle patterns in echo videos.
• The paper claims the method is compatible with different video lengths; however, in this method, the input needs to be capped to 128 frames.
• The method zero-pads the videos from 112112 to 128128. It is not clear why the frames are not up-sampled to the higher resolution. Zero paddings may not use the full capacity of the network input.

reject (3)

• Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors for my recommendation, as detailed in the “weaknesses” section, are the poor technical soundness, low performance, and inaccurate comparison with the prior art.

• What is the ranking of this paper in your review stack?

5

• Number of papers in your stack

5

• Reviewer confidence

Confident but not absolutely certain

# Primary Meta-Review

• Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors propose a combined CNN-transformer architecture for ED/ES frame detection and LVEF estimation from four-chamber echo videos. The public dataset of Echonet-Dynamic was used. Strengths: technical novelty of using a video transformer to address a classical problem of EF estimation, clarity of presentation, and well-conducted evaluation and ablation studies. Weaknesses: incomplete literature overview (reviewer 3 and 4), and technical ground of the LVEF prediction method used in this work (reviewer 4). The authors are expected present a more comprehensive literature review over related work and address the issues of LVEF estimation in response to the question raised by reviewer 4 and the issue of unfair comparison as raised by reviewer 3. The authors are encouraged to release the code for the research community (reviewer 1).

• What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

7

# Author Feedback

We thank the reviewers and the AC for their feedback!

Results comparison We compare our LVEF prediction results with Ouyang et al. [2] as we use their dataset. They present multiple modularized approaches with different preprocessing and LVEF accuracy. Our approach takes videos “as is” without preprocessing and predicts both LVEF and ES/ED frames indices in a single forward pass. Thus, we compare our results with the closest “All frames” evaluation from [2]. Indeed, we do not state clearly enough that their “beat-by-beat” method can be applied successively on all the heartbeats of a single US video. However, our method does not rely on segmentation performance and presents a novel approach to address LVEF prediction, which requires much less labelling work and shows excellent performance on the detection of the ES/ED frames. We also avoid all manual processing steps (e.g. resampling) which induce biases in the pipeline and prevent the generalization of the method to other domains. Our method is also much faster with an average processing time of 0.15s per entire video compared to 1.6s per heartbeat for EchoNet (0.05s x 32 frames). The best results from [2] will be added to our results (Table 1) in the LVEF prediction section for “Full video”, and we will clarify the abstract to emphasize the specificity of our approach.

LVEF computation Our method computes LVEF by averaging. As the outputs from the transformer are sent into fully connected layers, the outputs of the “EF Regressor, Dense 2” layer are not bound to their frame indices (because of the averaging step). Each neuron predicts a LVEF and the averaging takes all of them into account.

Literature Review: In the current version of our related works section we focus on LVEF and ES/ED applications. Thus, we do not elaborate on areas related in other ways to our work. We will integrate the references brought up by the reviewers and discuss their relation to our work. Of course, we will correct the duplication between references [14/15].

Code release Our code and trained models are ready to be released as a Github repository and the link will be added to the final version of the paper. We did not provide the link in the submitted version to respect the double-blind reviewing process.

Test time inference The network is trained on 128 frames videos to enable batch training. After training, the network is ready to handle videos of any length: shorter or longer. At test time, the videos are sent into the network with no modification other than scaling the pixel intensity. The number and the order of the frames is not modified. We show examples of arbitrary length videos in Figure 2 and in the supplementary figures.

R1 Regarding the assessment of the ES/ED detection performance on unlabelled portions, we can see from Figure 2 and the supplement that the distance between ES/ED frames is reasonably constant, indicating that the network is detecting the sensible frames.

R3 Our description of the LVEF may be misleading. To make our statement more accurate, we will change it to “The LVEF is the ratio of the stroke volume and the end-diastolic volume of the left ventricle”.

To clarify the structure of the dataset, the videos have variable lengths and framerates. Most of them contain more than one full heartbeat and all of them have a single labelled ES and ED, which can come in any order (ES then ED or ED then ES).

R4 Our training-time sampling methods were shown to produce convincing results at test time. “Guided random sampling” is an improved version of the usual sampling method for ES/ED prediction. We came up with temporal mirroring to improve the network capabilities on longer videos.

We added padding to the frames to keep the optimal ResNetAE network configuration. Using any interpolation on such a small scale would cause the pixels to smear and would degrade the visual quality of the ultrasound images.

[1] Li 2019, https://bit.ly/3hLwPQZ [2] Ouyang 2020, https:

# Post-rebuttal Meta-Reviews

## Meta-review # 1 (Primary)

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The study is well conducted, clearly written, and the authors have addressed in the rebuttal on (1) more comprehensive literature review over related work and (2) the issue of unfair comparison as raised by reviewer 3. (3) code release (reviewer 1).

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

8

## Meta-review #2

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

As indicated by the primary AC, the manuscript introduces a new hybrid approach for ED/ES detection. The proposed method is novel and there are sufficient experiments to explain the advantages for the proposed method. However, there are still some critical concerns after the rebuttal: 1) the literature review is incomplete. Authors response that the focus of the work is on LVEF / ESED applications and will integrate additional literatures referred by the reviewers. However, the responses didn’t explain the key technical difference to these works; and 2) quite limited information provided about the calculation of LVEF from the rebuttal. Therefore, it’s still not clear that the averaging of all the predications is the correct way to estimate the LVEF.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

19

## Meta-review #3

• Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The reviews have highlighted the fact that this paper proposes a valuable contribution in terms of methodology design. There were some concerns regarding the way the comparison to other SotA method was done, and the citation of prior works. In their rebuttal, the authors have provided explanations regarding both these issues, therefore I recommend acceptance for this paper.

• After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept

• What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

3