Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Tzu-Ming Harry Hsu, Yin-Chih Chelsea Wang

Abstract

Clinical finding summaries from an orthopantomogram, or a dental panoramic radiograph, have significant potential to improve patient communication and speed up clinical judgments. While orthopantomogram is a first-line tool for dental examinations, no existing work has explored the summarization of findings from it. A finding summary has to find teeth in the imaging study and label the teeth with several types of past treatments. To tackle the problem, we developDeepOPG that breaks the summarization process into functional segmentation and tooth localization, the latter of which is further refined by a novel dental coherence module. We also leverage weak supervision labels to improve detection results in a reinforcement learning scenario. Experiments show high efficacy of DeepOPG on finding summarization, achieving an overall AUC of 88.2% in detecting six types of findings. The proposed dental coherence and weak supervision both are shown to improve DeepOPG by adding 5.9% and 0.4% to AP@IoU=0.5. The dataset and code are made available online.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87240-3_35

SharedIt: https://rdcu.be/cyl59

Link to the code repository

https://github.com/stmharry/deepopg

Link to the dataset(s)

https://github.com/stmharry/deepopg


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper targets at generating the finding summary from an orthopantomogram study. The summary contains functional segmentation for artifacts and teeth, and teeth segmentation with class identification. The work is claimed to be the first tool for generating dental finding summary from orthopantomogram.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The work targets at a novel task of finding summary generating from dental imaging. However, the need and helpfulness of such system is not clear. To generate the summary, the work applies two models, one for functional segmentation, and the other for teeth segmentation, both of which have been previously visited by other works. The work claims to leverage weak supervision label of teeth with reinforcement learning. Promising and reasonable as the idea it is, the results in Table 2 shows the effectiveness of the current implementaion is limited.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The work should be better motivated with formative study or user study. Is there a need for dentists to access such summary when dealing with orthopantomogram? What is the time and labour cost of dentist with and without such information? According to my knowledge, dentists usually spend very little time on analyzing the orthopantomogram since it is very simple and straightforward.

    2. The methods used in generating the summary has been widely studied before. The reinforcement learning on weak signal is interesting, however the result does not support its effectiveness.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work utilizes public dataset with an addition of self-built labeling from dentists, which is necessary because of the lack of public one.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    See reproducibility and weaknesses for details.

  • Please state your overall opinion of the paper

    reject (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work targets at novel task of image summary for orthopantomogram study. However, the motivation is not clearly justified or proven. The methods applied in this work have been mostly studied in the existing works. Although the RL part is interesting, the result does not show its effectiveness in the current implementation.

  • What is the ranking of this paper in your review stack?

    5

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #2

  • Please describe the contribution of the paper

    Previous work only obtained the location, segmentation or number of teeth from the Orthopantomogram, but this paper proposes to obtain summarization of findings from the Orthopantomogram based on these. The authors developed the DeepOPG model and proposed functional segmentation and teeth localization modules with a novel dental coherence module to conclude the final findings. Also, weakly supervised labels were also used to improve detection results under reinforcement learning. Experimental results demonstrate the effectiveness of DeepOPG, showing that dental coherence module and weak supervision can improve performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A findings summary that not only identifies teeth in imaging studies, but also label teeth with several types of treatments has great potential to improve patient communication and speed up clinical judgment. The Orthopantomogram has only been segmenting, detected and numbered in previous work, but this paper can summarize all the findings directly, which will greatly simplify the procedure for clinicians and save the time for image analysis, which has great usefulness and feasibility in clinical application. To solve this problem, the authors used both a segmentation network and an object detection network to better identify each type of tooth. In dental detection, a region can detect several different classes at the same time, with a large degree of overlap mask. In order to solve this problem, the author proposed to maximum the Dental Coherence Reward (DCR) to further refine the output of the detection module, so as to obtain more accurate detection results. It is perceptive to add a DC module after the positioning module, which can ensure the consistency with dental knowledge and is similar to the way dental experts analyze OPG. To a certain extent, the reasoning process of the model is closer to that of professional doctors, with better interpretability. In addition, in order to use weakly labeled data, the authors propose a weakly supervised method using reinforcement learning to further improve the performance of tooth localization.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the authors present a new approach to solving the problem of not being able to obtain a direct summary of the findings from Orthopantomogram and elaborate on their approach, the analysis of the experimental results is still inadequate. In the part of experimental analysis, only the quantitative and qualitative analysis of the overall Evaluation results and ablation study were presented, and the visualization of the results in the qualitative analysis only provided the results of an Orthopantomogram, and there seemed to be no ground-truth label as a comparison. In particular, few experimental analyses of innovative modules, such as Dental Coherence Reward and Weakly Supervised Reinforcement Learning, have provided only the results of ablation studies, and the analysis of this result remains inadequate. Specially, the experimental analysis of weak supervision did not significantly improve the performance and did not give much explanation for the performance. In addition, the role of weak supervision in this method is not fully reflected in the paper, and it makes people feel that adding weak supervision in this method is not very important. Also, data from weakly supervised experiments are not detailed. As the author added some data annotations by himself, all the annotated data information was not very detailed.

    In addition, there is a lack of comparison with existing work. Although the final result is the finding summary, this result is also produced by the combination of the segmentation result and the instance segmentation result. Therefore, this work can be compared and analyzed with previous work in some aspects. However, in this paper, the author almost does not mention the comparison with the previous work.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the proposed algorithm needs to be verified. First, there is no code available for the algorithm. Second, the paper does not list the experimental details and parameter settings, and the use of dataset is not clear and detailed presentation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    As for the algorithm, the method of weak supervision needs to be improved. The weak supervision proposed in this paper does not reflect competitive performance and does not show the performance benefits of using weak labeling for dental detection. Therefore, newer weak supervision methods may need to be set up to improve the use of weak annotations.

    As for the data structure, the data used in this paper has a variety of labels, as well as self-annotated data and some weakly annotated data, so the annotation forms of this dataset should be diverse. The author should provide a detailed description of the data, especially in the weak annotation form. And clarify how this data differs from those used in other similar papers. For experimental analysis, more adequate experiments should be set up and detailed analysis conclusions should be given. For visualization of results, more than 1 group of sample results should be presented, and ground-truth labeled images should be presented for comparison. For the innovation sub-module, the author should give more experimental results and analysis to fully prove the effectiveness and reliability of the module. For example, in the DCR module, only ablation experiment and simple analysis are given in this paper, without other experimental Settings. So, the author should increase the experimental analysis of this part, such as setting up more detailed experiments to prove the validity. For the part of weak supervision, the author did not give the form, quantity and other information of weak annotation data in detail.

    At the same time, there are few experimental analyses on this part, which cannot fully prove the performance of the weakly supervised part. In the end, all the experiments presented by the author are about the performance analysis of the module proposed by himself, and there is almost no comparative analysis with the previous method.

    Although the authors evaluated the findings of the findings for Orthopantomogram, and previous work did not give this result, this does not mean that there is no method that cannot be compared. For example, the segmentation result and the instance detection result after DCR can be compared with the previous method to prove that the method proposed in this paper is more effective.

    The author presents using deep neural networks to predict the summary of findings for Orthopantomogram firstly.This is a relatively new study, but the author do not discuss the limitations of the methods or future research directions.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has a fair chance of acceptance. The summary of the findings of the Orthopantomogram can output a lot of important diagnostic information, which has a certain influence on clinical diagnosis. The Orthopantomogram findings can be directly output by this method, which is not studied by previous methods. This is an interesting contribution. The overall performance of the model is good, although the proposed weak supervision part does not get a good return.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    3

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel approach to summarizing the orthopantomograms, allowing for an interpretable and clinically relevant output, i.e., a summary combining information per tooth and finding. The main gain in performance achieved with the proposed Dental Coherence Model, designed to prevent from detecting the same teeth multiple times, done with the Coherence Reward, combining obtained mask candidates and class predictions (teeth and implant). Further improvements are achieved with reinforcement learning and segmentation mask concatenation at MaskRCNN input. The validation is performed on a publicly available dataset, that was further annotated to comply with the training objectives of the work. A few ablation studies are done to illustrate the contribution of each component of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a method to generate a findings summary for orthopantomograms, coping with the lack of the current state-of-the-art methods by efficiently combining multi-class semantical segmentation and detection tasks by the means of the introduced Dental Coherence Model.

    While the method requires several types of annotations (as described in Section 3.1.), they are all clinically feasible. Moreover, the proposed method allows for training under weak supervision, which makes the method more compelling.

    Overall, the paper is clear and pleasant to read. The methods are generally well described. An ablation study is done to illustrate the contribution of each of the components of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The complexity of the method and the need for multi-step training, as described in Section 2.2, makes the evaluation of each component harder, as that results in a plurality of hyper-parameters and training settings for each of the training steps.

    Moreover, the numerical comparison to the state-of-the-art methods is overlooked. While the authors explain the difference of their work compared to the state of the art, there is still some overlapping and lacking numerical comparison e.g., referring and discussing the results claimed in other works makes more difficult the understanding of the proposed improvements.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors state the network architectures being used (i.e., resnet50, resnet18, maskrcnn).

    The authors state the libraries being used (i.e., Tensorflow).

    The authors describe the type of the additional annotations collected for the study.

    The metrics are generally given with their standard error.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. On page 4 in Inference-Time Dental Coherence Decoding the authors introduce the Dental Coherence Reward (DCR) that is in inference-time. However, the implementation is not absolutely clear. Could the authors discuss a bit more, how the DCR is introduced in the calculation of the finding summary and whether there any thresholds involved?

    2. I appreciate the supplementary material provided by the authors. Figure 4 in supplementary material raises the question of the validation methodology. That is, the best performing class is Background, while other classes have significantly lower performances. To help the understanding of the metrics, could the authors explicitly state, which classes have been used for the average metrics (e.g., AP in Table 2), i.e., whether the averaging is done over 34 classes (see 2.1), or less.

    3. The ablation studies illustrate well the contribution of the components and are well discussed. While the comparison to other state-of-the-art methods might not be totally fair, it would be helpful, if the authors could mention the metrics claimed in other words, to save the reader’s time.

    4. In the first paragraph of Methods (page 3) the authors mention working with images of “original” resolutions to keep “tiny” findings visible. To better illustrate the importance, it would be helpful if the authors support this statement with numerical data or ratios of finding’s area to the area of the image.

    5. In the last paragraph, if 2.2 the authors state “freezing” of the layers when the DCR is introduced. Could the authors discuss the motivation of such a choice, if any?

    6. In the last paragraph of 3.3 the authors state that “without RL the model can miss some teeth”. However, the numerical results w/ and w/o RL from the ablation studies are quite comparable. Could the authors clarify?

    7. In the Dataset description (3.1) the authors state that the additional annotations were generated (68+39+144+47): it sums up to 298 cases. Moreover, the authors mention excluding studies with mixed dentation. It would help the reader if the authors could detail a bit more the final numbers of cases used: that is, whether the 298 newly annotated cases are overlapping with the 267 originally annotated, and how many cases are excluded.

    Minor observations:

    1. The last paragraph of 2.2 (starting with To learn DeepOPG) is probably better to be separated with a subtitle to improve readability.

    2. In the first paragraph of Introduction, “experienced experts” could be rephrased.

    3. In 2.1: “The functional segmentation map and the original image is”, “is” should be replaced by “are”.

  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is generally clear and pleasant to read. While the proposed method in some way is the combination of several existing techniques (i.e., segmentation network + mask rcnn), the combination is effective and the proposed Coherence Model is definitely is worth attention. On the downside, the lack of discussion of numerical results of the other state-of-the-art works prevents from firmly accepting the paper.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers acknowledge that the paper is addressing an interesting and important problem. The presentation and the organization of the manuscript are well done. However, reviewers concerned that there are insufficient comparisons to the state-of-the-art. The significance of the results should also be better justified.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    6




Author Feedback

We thank all the reviewers for their feedback. Our paper proposes a novel pipeline to summarize dental orthopantomograms (OPGs) into finding summaries, and a method to leverage weak supervision data that is fast to annotate, saving expert man hours. Data and code will be released. We hereby address comments:

The use of finding summary We agree that the motivation can be elaborated more and we will add to the final version. While dentists assess OPGs quickly (0.5-3min), they express dire need to quickly communicate findings and treatment plans with patients as R2 noted. An automated and visualized result like ours provides a solution. Moreover, the summaries we produce in the clinics can be systematically collected as a byproduct, which is an invaluable source for subsequent dental research that current clinical workflow cannot provide. <R2,R3> Data details We agree that more details are needed on data, and will add to the manuscript. In the UESC dataset, 267 out of 1500 OPG images came with localization annotations. We enrich the data with 39 localization, 68 segmentation, 144 weak supervision, and 47 finding summary annotations. They do not overlap in study, so it amounts to 565 different annotations. The localization and segmentation are per-pixel and take around 30 minutes for dentists to label. The weak supervision is the binary-labeled “missing teeth” part of finding summary which only takes 30 seconds to label. The full summary contains six findings. <R1,R2,M2> The effectiveness of weak supervision We provide supporting arguments from a data preparation perspective and will highlight in the manuscript. The most important clinical endpoint is AUC in Table 1. Using numbers from above details, the “w/o RL” model (86.6% AUC) trains with 273 per-pixel annotations which takes 135 expert hours to prepare. The “DeepOPG (full)” model adds additional 100 weak supervision labels which only takes 0.8 expert hours, but a gain of 1.6% AUC. This demonstrates weak supervision is effective in boosting AUC while requiring substantially less expert effort (<1% extra time) than per-pixel annotations. <R2,R3> Comparison to existing work We compare existing SOTA following reviewers’ suggestions. No prior work produces a finding summary as we do, so we resort to comparing individual components, and will add to the manuscript. On teeth-only segmentation, evaluated with macro-F1%, [1] and [2] reached 74.4 and 88, while ours yield 91.0. On single-class teeth localization, evaluated with precision%/sensitivity%, [3] gave 99.4/99.4 while we have 100.0/99.8. Comparing sensitivity%/precision% on “missing teeth” summary, [4] achieved 75.5/84.5 at a specificity% of 80.4. We show 94.3/96.4 at the same specificity. [1] Wirtz (2018) doi: 10.1007/978-3-030-00937-3_81 [2] Jader (2018) doi: 10.1109/SIBGRAPI.2018.00058 [3] Tuzoff (2019) doi: 10.1259/dmfr.20180051 [4] Kim (2020) doi: 10.3390/app10165624 <R2,R3> Hyperparam/Training We use Adam optimizer for all models, and the hyperparams are chosen with the validation set. Segmentation uses LR(learning_rate)=1e-5, WD(weight_decay)=1e-4, B(batch_size)=4, and S(max_step)=12000. Localization is first learned with LR=1e-4, WD=1e-4, B=1, and S=100000, then finetuned with DCR using LR=1e-5 until S=250000. Only the last layer is tuned to avoid overfitting. <R2,R3> Other comments Maximizing DCR is formulated as a GQAP problem, which is widely studied in optimization theory. Per-object metrics in Table 2 accounts for all 33 non-background classes using PASCAL VOC-style calculation. Findings by ascending area are root filling (0.09% of all OPG), implant (0.33%), restoration (0.42%), crown & bridge (0.77%), impaction (0.78%), and teeth (11.5%). The comparison in 3.3 between “full” and “w/o RL” is based on single-point qualitative analysis of Fig. 2 which we will retone. We will fix visualization to include ground truth, include editorial suggestions, and discuss limitations and future directions.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Overall, the manuscript presents several contributions to the field: (1) a novel method for finding summaries from an orthopantomogram; (2) coupled weak supervision and reinforcement learning to achieve a good performance without intensive manual labelling process. All the critical concerns have been satisfactorily addressed from the rebuttal. The additional comparison to the existing state-of-the-art is also impressive. It is true that the individual component e.g., functional segmentation module, seems to be based on existing implementations. However, the whole pipeline and the application is novel, and the proposed method will have a good clinical impact.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    11



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After assessing the paper, the reviews and meta reviews, my two main concerns with this work were the lack of comparison to related methods on segmentation and tooth localization as well as the small, nearly negligible improvement that the - although interesting - weak supervised Reinforcement Learning step brought. The first aspect was convincingly addressed by the authors in the rebuttal, while the second aspect still is questionable, especially since this step is the only methodological contribution (while practically useful, the use of standard architectures for segmentation and localization does not provide a lot of scientific impact). After weighing these two aspects, I slightly favor acceptance in agreement with two out of three reviewers, due to the up to my knowledge potentially novel application area of fast generation of OPG summaries for patient communication.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    12



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with Reviewer 1 that the motivation of the manuscript should be better, as well as with other reviewers that the effectiveness of the RL for semi-supervised learning not fully justified. There should also be a few words about the limitations and future work. However, the main weakness of the manuscript is the lack of a strong comparison with SOTA methods. The authors proposed extending the manuscript with the results of four SOTA methods. Since this extension is doable, and the proposed method outperforms the results of the SOTA methods on the tasks of segmentation, localization, or detection of missing teeth, I consider this manuscript acceptable for publication.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    7



back to top