Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews

Authors

Yixiong Chen, Chunhui Zhang, Li Liu, Cheng Feng, Changfeng Dong, Yongfang Luo, Xiang Wan

Abstract

Most deep neural networks (DNNs) based ultrasound (US) medical image analysis models use pretrained backbones (e.g. ImageNet) for better model generalization. However, the domain gap between natural and medical images causes an inevitable performance bottleneck. To alleviate this problem, an US dataset named US-4 is constructed for direct pretraining on the same domain. It contains over 23,000 images from four US video sub-datasets. To learn robust features from US-4, we propose an US semi-supervised contrastive learning method, named USCL, for pretraining. In order to avoid high similarities between negative pairs as well as mine abundant visual features from limited US videos, USCL adopts a sample pair generation method to enrich the feature involved in a single step of contrastive optimization. Extensive experiments on several downstream tasks show the superiority of USCL pretraining against ImageNet pretraining and other state-of-the-art (SOTA) pretraining approaches. In particular, USCL pretrained backbone achieves fine-tuning accuracy of over 94% on POCUS dataset, which is 10% higher than 84% of the ImageNet pretrained model. The source codes of this work are available at https://github.com/983632847/USCL.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87237-3_60

SharedIt: https://rdcu.be/cymbn

Link to the code repository

https://github.com/983632847/USCL

Link to the dataset(s)

https://github.com/983632847/USCL

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a novel methodology for pretraining a deep neural network to solve ultrasound-related tasks such as classification, detection and segmentation, with the aim to have a better model generalisation. The proposed method in based on a semi-supervised contrastive learning approach, where frames from the same video are considered as a positive pair (i.e., similar) and frames from different videos are considered as a negative pair (i.e., different). The pretrained network is trained on ultrasound images from lung and liver, and the evaluation is performed on two ultrasound datasets, one composed of lung images and the other one of breast images. The ablation study compares the proposed pretraining methodology with a network pretrained on natural images (ImageNet) and shows an improvement in all downstream tasks (classification, detection and segmentation).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel idea for pretraining deep neural networks to solve ultrasound-related tasks
- Simple but original methodology to solve the similarity conflict in a semi-supervised contrastive representation learning, including an enriched approach to create positive pairs (i.e., pairs of similar images).
- Code and data will be made public, which could facilitate the use of this methodology and be of interest in the research community as currently there is a lack of pre-trained models for ultrasound.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Limited evaluation and discussion on what combination of datasets may work better. For instance, is it preferable to pretrain the method on a collection of datasets that are very different to each other (e.g., lung, liver) ? Or is it preferred to create a dataset with similar characteristics as the task to be solved (e.g., lung centre 1, lung centre 2)? The ablation study provided in the Annex shows results on different dataset combinations, however it is not clear if the increase in accuracy is due to the increase in data size or due to the increase in image variation.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Some details are missing, for instance: fine-tuning is used on two datasets: POCUS (lung) and UDIAT-B (breast), but the fine-tuning learning rate is not mentioned. The amount of images used for training and validation from the POCUS dataset is not mentioned. On a positive note, code and data will be made public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
Pretraining data variety: The pretraining data variety experiment is limited. From the results it is not clear if the increase in accuracy is due to the addition of a dataset or to the addition of more data in general. One way to answer this question could be by selecting a dataset with X number of images from Butterfly + LUSMS and another dataset with X number of images from CLUST + Liver Fibrosis and compare the results on the POCUS dataset and breast dataset. I believe this experiment would give the reader and idea of what types of dataset may work better in a different application. Data: Important information regarding the datasets is missing in the main paper. For instance:
- The POCUS dataset is mentioned in the abstract but not defined/explained, please consider adding a short description.
- The POCUS dataset is missing a reference in the main text
- The UDIAT-B dataset is also not well defined: please include that this is a breast dataset to help the reader better understand the variability between the images used in the pretrained method and the fine-tune experiments. Some information is included in the Annex. However, crucial information is still missing. For example:
- A reference for the CLUST dataset is given, but the authors should mention that is a liver dataset. This information may help the reader understand the variability included in the pretrained model.
- What do the different categories (F0, F1, F2, F3) in the Liver Fibrosis dataset mean? Also, it should be mentioned if the videos come from different patients or there may be multiple videos from a patient. Results:
- Figure 4, please include the full definition of RS, RSOFV, etc. in the caption.
- Figure 5: The strategy followed to select the cases shown is figure 5 is not explained. Are these a collection of the best examples? Or the examples were randomly selected? Also, please include the original label, and the predicted labels for ImageNet and USCL.
- Classification: I assume the values reported in table 2 correspond to the accuracy? This should be included in the caption of the table and accuracy metric should be defined.
- Detection and segmentation: the COCO metric AP is referenced but not defined. Since this is not a very common metric to use in medical imaging, a short definition should be included in the paper. Also, the results report values 38~45 for detection and 42~52 in segmentation. Are not these values too low? Is this expected? Minor: Figure 3 is referenced before figure 2.
Please state your overall opinion of the paper

accept (8)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Pre-trained models for ultrasound are scarce.

This is a simple but effective way to provide pretrained models which seems to work better than models pretrained on natural images (such as ImageNet).
What is the ranking of this paper in your review stack?

1
Number of papers in your stack

3
Reviewer confidence

Confident but not absolutely certain

Review #2

Please describe the contribution of the paper

In the paper “USCL: Pretraining Deep Ultrasound Image Diagnosis Model through Video Contrastive Representation Learning”, first a new dataset US-4 is constructed from four ultrasound video datasets. Then a semi-supervised contrastive learning method is used to pre-train a model, which is fine-tuned to perform downstream tasks such as image classification and segmentation/detection. For contrastive learning, the authors propose a sample pair generation method to define positive and negative image pairs from ultrasound videos. The method is compared with existing semi-supervised and self-supervised methods, and ImageNet pretraining.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The problem is well-formulated to use fewer annotations in a semi-supervised way. Models are pre-trained via contrastive learning which can be used for other downstream applications.
2. The work provides strong evaluations with existing works and ImageNet pre-training baseline.
3. The work shows promising results for ultrasound classification and detection downstream tasks. It can be useful when small number of labelled samples are available for training.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The authors combine 4 video datasets into one US-4 dataset. There are different image and video characteristics of these 4 datasets (can be seen in Fig.2 and Table 1). It is not clear if there were any measures to normalize the data across datasets and prevent the model from learning dataset-specific features. For e.g., different frame rates, image sizes, and image qualities can lead to bias in the learnt model.
2. Several frames are extracted from the raw videos. The authors could explain how only meaningful frames are selected for learning the model, as the ultrasound video may contain background frames, blurring, shadowing, and other artefacts.
3. The domain used for pre-training consists of liver and lung ultrasound videos. The downstream tasks are performed on lung and breast datasets. The choice of datasets for pre-training and fine-tuning seems arbitrary, so the authors may want to discuss how close or far the domains should be for pretrained and finetuning tasks. This would be useful to adapt the method to other domains.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have stated that the constructed US-4 dataset and source code of this work will be made public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
1. In the introduction, readers may not clearly understand the stated conflict with respect to the ultrasound tasks being addressed. For e.g. it is not clear from the text why samples from ‘different videos’ of the same structure would form different clusters in the feature space. If this refers to videos with different labels (of different structures), it should be clearly stated.
2. The authors say that ‘We can see that the US-4 dataset is relatively balanced, where most videos contain tens of US images.’ These are balanced with respect to which quantity? Number of videos or images in each dataset do not appear to be balanced.
3. The authors pretrained the model with US-4 dataset. They fine-tuned the model with POCUS and UDIAT-B datasets, and a part of these two datasets were used for testing. What happens if the pre-trained datasets were directly used on the new datasets without fine-tuning?
4. Please check typos such as ‘COIVD-19’
Please state your overall opinion of the paper

borderline accept (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel method for semi-supervised contrastive learning using ultrasound videos. The idea of contrastive learning is well-known. The paper proposes a method to generate positive and negative pairs from US videos in a semi-supervised way to pre-train the models using small number of labelled samples. The model is shown to perform well on downstream tasks such as detection/segmentation and classification, and outperforms existing semi-supervised and self-supervised methods. There are a few comments (see above) which need to be addressed for a better clarity to the readers.
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Review #3

Please describe the contribution of the paper

In this paper, the authors release a novel dataset, then explore a Contrastive Representation Learning for the pretraining of ultrasound image analysis models. The idea makes sense and sounds very interesting.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper has a clear organization and task definition;
2. The authors also plan to release a Dataset, which can be very helpful to promote the research in this field;
3. This paper has a good technical contribution, and contrastive semi-supervised learning is under-explored in previous studies;
4. Convincing results on different tasks
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The sub dataset comes from different domains, it is interesting to know the domain gap can decrease the performance of down-stream tasks.
2. Only ResNet18 is used in this study. It is interesting to know similar performance improvements can be observed using different backbone architectures.
3. The experimental results are as expected, which is not surprising.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code is provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

None
Please state your overall opinion of the paper

Probably accept (7)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

None
What is the ranking of this paper in your review stack?

2
Number of papers in your stack

4
Reviewer confidence

Confident but not absolutely certain

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a novel ultrasound dataset and shows results of contrastive learning with this dataset. The new dataset would be a great contribution to the research community. The paper seems to be written very clearly.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

1

Author Feedback

We would like to thank the chairs and reviewers for their efforts in reviewing this paper. Key comments and responses are summarized as follows. Due to the limited space, for other issues (i.e., typos and presentation problems), we promise to correct them in the revised version.

Q1: Evaluation and discussion of data size and image variation (R1, R2 and R3)

Response: The size and domain variation of the US-4 dataset are both beneficial to our USCL pretraining. (1) For data size, we have analyzed its effect in the current version, please refer to Supplementary Material Tab. 3. (2) For domain variation, we simply treat different organs as different domains and conduct some new pretraining experiments. Results and brief analysis are as follows (benchmarking on POCUS dataset): 1) Single domain: Butterfly (85.0%), CLUST (89.7%), Liver Fibrosis (90.4%), COVID19-LUSMS (90.6%). Bigger dataset can achieve better model performance. 2) Two domains: Butterfly+CLUST (88.5%), Liver Fibrosis+COVID19-LUSMS (90.4%), CLUST+Liver Fibrosis (90.8%), Butterfly+COVID19-LUSMS (91.5%), CLUST+COVID19-LUSMS (92.3%), Butterfly+Liver Fibrosis (92.7%). Combining lung sub-dataset (Butterfly) and liver sub-dataset (Liver Fibrosis), we achieved the best accuracy 92.7%, which was higher than the combination between the same organs (Butterfly+COVID19-LUSMS or CLUST+Liver Fibrosis, which have the data size similar to Butterfly+Liver Fibrosis).

Above results demonstrate that increasing both the size and variety of data can promote our USCL pre-training. As for how close or far the domains should be for pretraining and fine-tuning, we used the POCUS lung dataset to clarify the effectiveness of our method because it has the similar domain as US-4 pretraining dataset (both contain convex-probe data of lungs), and used UDIAT-B (linear probe) breast dataset to further indicate the generalization of the method facing larger domain gap. USCL worked well on both cases.

Q2: Reproducibility and details of the experimental settings (R1, R3)

Response: We fine-tuned the last 3 layers of our US-4 pretrained backbone (ResNet18) on POCUS dataset and all layers on UDIAT-B dataset. On POCUS and UDIAT-B, the learning rates were 0.01 and 0.005, respectively. The training, testing code and US-4 dataset are available on GitHub.

Q3: It is not clear if there were any measures to normalize the data across datasets and prevent the model from learning dataset-specific features (R2)

Response: Because of the moderate domain differences between sub-datasets, we normalized the whole pretraining datasets w.r.t the overall mean and variance instead of special measures. We also found that the risk of learning dataset-specific features wasn’t obvious due to the robust encoding ability of contrastive learning with data variation.

Q4: How only meaningful frames are selected for learning the model (R2)

Response: (1) In fact, it is not certain that only meaningful frames are selected, but due to our careful data collection, there should exist few meaningless frames in our US-4 dataset. (2) In addition, the frame mix-up scheme can further reduce the chance that the final samples are meaningless.

Q5: Evaluation without fine-tuning (R2)

Response: We followed the prevailing contrastive learning setting, first pre-training on US-4 dataset, and then fine-tuning on downstream tasks (i.e., POCUS, UDIAT-B) to evaluate the performance of the pretrained backbones. Moreover, the reason why we did not directly use pretraining model to new dataset without finetuning is that this requires new dataset and pretraining dataset having the same categories in our experimental setting.

Q6: Result using different backbone architectures (R3)

Response: In addition to ResNet18, we additionally tried new experiments using the ResNet34, reaching 94.0% accuracy on POCUS, which is comparable to the accuracy (94.2%) achieved by ResNet18.

back to top

USCL: Pretraining Deep Ultrasound Image Diagnosis Model through Video Contrastive Representation Learning