Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yaqub Jonmohamadi, Shahnewaz Ali, Fengbei Liu, Jonathan Roberts, Ross Crawford, Gustavo Carneiro, Ajay K. Pandey

Abstract

Minimally invasive surgery (MIS) has many documented advantages, but the surgeon’s limited visual contact with the scene can be problematic. Hence, systems that can help surgeons navigate, such as a method that can produce a 3D semantic map, can compensate for the limitation above. In theory, we can borrow 3D semantic mapping techniques developed for robotics, but this requires finding solutions to the following challenges in MIS: 1) semantic segmentation, 2) depth estimation, and 3) pose estimation. In this paper, we propose the first 3D semantic mapping system from knee arthroscopy that solves the three challenges above. Using out-of-distribution non-human datasets, where pose could be labeled, we jointly train depth+pose estimators using self-supervised and supervised losses. Using an in-distribution human knee dataset, we train a fully-supervised semantic segmentation system to label arthroscopic image pixels into femur, ACL, and meniscus. Taking testing images from human knees, we combine the results from these two systems to automatically create 3D semantic maps of the human knee. The result of this work opens the pathway to the generation of intra-operative 3D semantic mapping, registration with pre-operative data, and robotic-assisted arthroscopy.

Link to paper

DOI: https://doi.org/10.1007/978-3-030-87196-3_36

SharedIt: https://rdcu.be/cyl2G

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a deep-learning approach to predict the depth and the pose of an endoscopic camera as well as a pixel-wise semantic label for the depth map with application to arthroscopy. The authors train the network on in-vivo and in-vitro images using both supervised and self-supervised strategies. Commonly used network architectures that solves segmentation, depth mapping and pose estimation problems are used. Experiments are conducted to demonstrate the efficacy of the system.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    [1] The proposed solution appears to be first attempt to predict depth, semantic label, and the pose with application to arthroscopy. Rather than solving a single network end-to-end, the solution consists of three networks optimized to solve each problem. Both quantitative and qualitative results show that the network performs reasonably well.

    [2] The paper reads well with an adequate introduction to the problem and comprehensive description of the methods. The results are well presented with explanations where necessary. However, I feel that the organization needs further improvements.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    [1] Results section does not discuss the running time of the system which is an important parameter in a system intended for surgical applications.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have not indicated willingness to share the source code for their implementation. Neither will the authors share training and validation data. However, training parameters are reported. Some important details such as how the ground-truth is computed is not included in the paper. In such circumstances, it will be very difficult for an interested reader to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    [1] The introductory section is adequate for a conference publication. However, if the authors could allude to clinical expectations of the system (e.g. accuracy bounds), a wider audience would benefit from it. In addition, define the acronyms ACL (in the Introduction) and Trl and Ang (in the Method section).

    [2] In Methods, references to left image can be found. If a stereoscopic arthroscopic system is used, explicitly mention that for the benefit for the reader.

    [3] In section 4.1, how the ground-truth poses are computed by attaching a magnetic sensor to the camera is described. However, the accuracy of the ground-truth depends on the accuracy of calibration transforms (e.g. hand-eye calibration). Such details are missing.

    [4] ATE is used as the metric to compare predicted poses to the ground-truth. This metric neglect orientation details. Is orientation less important in the target application? Describe.

    [5] Details of network architectures are given in the Experiments section. I think, with this information in the Methods section, the paper will read better.

    [6] Some quantitative results re the segmentation network will be valuable.

    [7] Details of the running time is required.

  • Please state your overall opinion of the paper

    accept (8)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    [1] The deep-learning framework proposed in this paper solves an important problem in image-guided surgery. This seems to be the first effort in proposing an end-to-end solution. Therefore, the approach has adequate novelty.

    [2] The efficacy of the proposed method is demonstrated to some degree.

    [3] The paper reads well with adequate references to the prior-art, comprehensive description on the methods while the results are well presented with discussion where necessary.

  • What is the ranking of this paper in your review stack?

    2

  • Number of papers in your stack

    5

  • Reviewer confidence

    Very confident



Review #2

  • Please describe the contribution of the paper

    The authors propose a method to create a 3D semantic volume of the region seen by an endoscope during arthroscopy. It is based solely on stereo endoscopic sequence videos (and pose tracking for the training). At each frame, semantic segmentation is done using a trained neural network and, depth map and motion is estimated combining two neural networks trained with partly phantom data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strength of this paper is to propose a full pipeline combining several state-of-the-art methods and adapt them for the particular problem of arthroscopy (little texture and edge information, over/under illumination, no clear ground truth). The experimental setup has various sequences of data (sheep, human cadavers). Also, phantom data is used during the neural network training to face the problem of limited and poor quality data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Parts of the method add some confusion (see detailed comments) and results section is weak. In the experimental setup, it seems there are enough data to train/evaluate and test the method both qualitatively and quantitatively. But in the result, there is no quantitative results such as pose distance error/ATE for example (only qualitative results). The visual reconstructions could be visually and medically rated for example? Only the loss functions are used to compare results. The semantic segmentation is barely evaluated as well. If it is done in another paper, it should be stated clearly and a short summary could be given.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is based solely here on the method and experimental setup and literature references.

    • Coefficient used during the experiences of alpha in Eq. 2 and lambda_{smoo} in Eq. 5 are not given.
    • Part of the neural networks could be more detailed such as the ‘view synthesis’ part.
    • The reconstruction part mention that either the segmentation or the depth map should be given. It seems depth map is used following the caption of figure 3 but it is not completely clear to me.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    Page 1 Abstract

    • Please define ACL acronym before using it (anterior cruciate ligament)
    • Miss quantitative results here?

    Page 2

    • “Hence, we advocate the use of pose annotation acquired from images from non-human environments to supervise the training of depth+pose using a self-supervised + supervised loss function.” -> I don’t see the link with the previous paragraph. Actually this sentence should be in the previous paragraph? You could maybe explain why and how your proposed methods solve the problems you cited in the previous paragraph. And then, start a new paragraph with the semantic segmentation part.

    Page 3

    • “the RGB arthroscopic frames are transformed into 36 spectral bands and then the spatial features of anatomical structures are used at wavelengths from 380-740 nm with 10 nm of intervals as a preprocessing step” -> Maybe you could state that this wavelengths range corresponds to the human vision spectrum.
    • “extracts spatial characteristics at these 36 spectral bands and subsequently learns the location along with its label.” -> What do you mean by “location”?
    • This whole paragraph should be maybe in the Methods section? Because you do not mention anything on the segmentation part (except in Fig. 1) in the Methods. What are the different labels for example?
    1. Methods
      • What is ‘l’ in the ego motion X ? After you use ‘l’ for ‘left’.
      • “rotation and translation, in the Euler coordinates” -> you miss ‘.’ at the end of the sentence
      • “We achieve this by training the depth+pose network on the self-supervised plus supervised objectives.” -> Replace ‘plus’ by ‘and’?
      • It seems you use stereoscopic endoscope, it could be nice to mention it before in the introduction.
      • Equation 3, the min is pixel-wise or image-wise? Should you write Ihat_t \in I_S below ‘min’?
      • Equation 4, I don’t understand this formula. It is the same problem as Eq. 3. There is a confusion between I_S and Ihat^t
      • Equation 5, shouldn’t it be e- with minus?

    Page 4

    • “Since most of the variation in the camera pose is in at the x and y axes of the translation” -> in at
    • “surrounding the incision wholes” -> hole?
    • “Without the L’_{Trl}, the network performs poorly for the frames with small changes in motion.” -> Could you add some intuition behind this? It seems that you add a loss to minimize the direction instead of the translation. What about the angles, what does it mean to normalized angles/rotations?
    • “Since most of the variation in the camera pose is in at the x and y axes of the translation, the weighting of [0.5, 0.5, 1] was applied to the LTrl” -> How the weighting is applied? Because here it seems you give actually more weight to z? More weights should be given to x and y? no? Also, you could remove this sentence as it is already in the Experimental setup section.
    • Figure 1
      • Since there is a caption for every symbol, you should do it for M_t as well
      • Should the view synthesized with X_{l->r} be I^r_t and not I^l_t? and X_{l->r} should be X_{r->l}?
      • The two L_{self} should be L_{phot} instead and then add a link to form L_{self}?
      • How is the ‘view synthesis’ done and implemented neural-network-wise?

    Page 5

    • “Stereo images were rectified and downsampled to 256×256.” -> Can you explain why?
    • “The groundtruth poses of the camera tip were recorded by attaching an NDI magnetic sensor.” -> What/How temporal calibration has been done to synchronize frame and tracked position?
    • “the groudtruth poses” -> ground truth
    • “For the pose, only the encoder was used” -> Any particular changes at the end of the network as the pose estimation is a regression problem and not a binary one?

      4.2 Segmentation training

    • How the data has been annotated? which labels?
    • The 3D reconstruction part test and evaluation is missing in the experimental setup.

    Page 6 “Fig. 3 shows sample 3D maps obtained by fusing chunks of arthroscope frames from a human cadaver knee” -> The 5 sequences is only from one human? In the experimental setup you mention at least 3 cadaver experiment for the test set.

    Page 7

    • Figure 2
      • There is a typo in the legend of (a)(b)(c)
      • I am not sure I understand what is Self-supervised only? When L_{pose} is not used?
      • In the results section, you compare the losses instead of e.g. average distance error/ATE but what is the unit here for L’‘{trl} for example? Is it meter? millimeter? Because the order of magnitude e-3 and e-4 is very small. Is it significant to compare this? Same thing for the angle. For L{self}, it is hard as well to have an idea of comparison. What is the threshold between correct and incorrect reprojection?
      • (d) What is the meaning of the horizontal axis: “Distance m”? What is the link with the frames?
    1. Conclusion
      • “The proposed domain adaptive approach” -> What do you mean by domain adaptive approach?

    Page 8

    • Figure 3
      • You should define TSDF acronym
      • I don’t understand how the knee model on the left side has been drawn? Is it a reconstruction? a 2D projection of a specific view?
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The full pipeline is really interesting and the dataset seems correct but the evaluation of the method is weak.

  • What is the ranking of this paper in your review stack?

    3

  • Number of papers in your stack

    5

  • Reviewer confidence

    Confident but not absolutely certain



Review #3

  • Please describe the contribution of the paper

    The authors proposed a deep learning based 3D mapping framework for assisting Arthroscopy. Camera poses and depth maps are first estimated by neural networks trained using both self-supervised and supervised objectives. A semantic segmentation network is applied to highlight different tissue types on the 3D reconstructed map of human knee. In general, this is an interesting application of multiple CNN models for arthroscopic surgery.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper leverages multiple neural networks for estimating pose, depth map and tissue segmentation to achieve a novel 3D semantic reconstruction for Arthroscopy procedure. Although the depth+poss network architectures and loss terms are adapted from the existing methods, the authors compared the performance of using only self-supervised loss (mainly photometric reprojection loss) with adding supervised pose losses. They also managed to train the depth+pose network using a combined dataset of phantom and sheep knee. Additionally, the semantic segmentation network (U-Net) uses multi-spectral information as input rather than RGB images. The quantitative and qualitative results look promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper integrates Disp-Net [19], ResNet50 [9] (Pose network) and U-Net [1] into a framework that estimates depth maps, poses and tissue labels for 3D semantic mapping for Arthroscopy. The novelty in terms of the network architectures and loss terms are limited. In the conclusion, the authors claimed that the depth+pose model is domain adaptive. More results or evidence shall be provided to justify the generalizability of the model.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors provide adequate details for training the networks, including the values of the hyper parameters. Each step in the workflow is well listed. References are also provided for existing methods in the literature. It would be nice to make the three sets of data used in the training and testing publicly available for better reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html
    1. In Fig. 1., I_(t+1) as an input shall be linked to the view synthesis of I_t since both pose X_(t->t+1) and I_(t+1) are required for generating the predicted I_t. The same for I_r.
    2. According to section 4.1, both data collected from a 3D printed model and a sheep knee are used in training the depth+pose network. The trained model is directly applied on cadaver data and show good 3D reconstruction result in Fig.3. Do the video images of sheep knee look very similar to cadaver images? It would be good to include example images collected from 3D printed model and sheep knee to show the adaptability of the depth+pose network. Cross-validation can also be included to better justify the generalizability of the model.
    3. Please provide some quantitative analysis of the performance/accuracy of semantic segmentation network.
  • Please state your overall opinion of the paper

    Probably accept (7)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall this paper proposed a novel and useful 3D semantic mapping system for knee Arthroscopy. It leverages multiple networks for estimating pose, depth map and semantic tissue segmentation. The paper reads well. Plots and figures demonstrate that the performance of the system is promising.

  • What is the ranking of this paper in your review stack?

    1

  • Number of papers in your stack

    1

  • Reviewer confidence

    Confident but not absolutely certain



Review #4

  • Please describe the contribution of the paper

    This paper presents a method that leverages training data from both non-human and human anatomy as well as both self-supervised and supervised training to learn camera pose, dense depth estimates, and semantic segmentation of the generated point cloud.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The combination of supervised and self-supervised errors is a nice idea and authors are able to combine these in a neat framework. Qualitative results look very nice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For the weighting of the translation loss, it seems counterintuitive to weight the x and y axes lower when most of the motion is along the x and y axes. Authors need to explain why this choice makes sense.

    Qualitative errors shown in Fig. 2(d) are from animal data, which is the texture rich phantom data. So we expect smaller errors. However, errors in Fig. 3 seem fairly large (up to 3cm). Is this acceptable for procedures in the knee region? Finally, there is no quantitative evaluation for semantic segmentation. Can authors report something like dice score?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Some key implementation details are missing making the paper hard to reproduce. For instance, on page 6, authors mention “more details are available on [1]”, which conflicts with MICCAI guidelines that state that a paper should be self contained. That is, reviewers should not have to refer to another paper to understand the methods presented in the current paper. Further, it is not explained what several of the weighting coefficients were set to (alpha, lambda, etc.).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://miccai2021.org/en/REVIEWER-GUIDELINES.html

    It would be helpful to know up front that the human data that authors refer to is cadaver data. That is, the reader should not have to wait until the experiment setup to learn this when human data is mentioned earlier in the paper. It gives the impression that authors are working with in vivo data.

    It is hard to gauge how good the results shown are. Authors should established what level of error would be acceptable in these procedures.

  • Please state your overall opinion of the paper

    borderline reject (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the methods presented seem well founded, implementation details (what are parameters set to in order to produce the results reported) and quantitative evaluation are lacking. If authors can address these, the evaluation on this paper could be changed to probably accept.

  • What is the ranking of this paper in your review stack?

    6

  • Number of papers in your stack

    3

  • Reviewer confidence

    Very confident




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper was well-reviewed by four reviewers with expertise in the field. The first three reviewers generally recommended accepting this paper. I would join this recommendation. The review did raise many questions that would be addressed in the revision. Reviewer 5 maybe correct the answers of Q 10 and Q 11.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers).

    1




Author Feedback

The authors appreciate the comments made by the reviewers, particularly by reviewer 2 .

Here are answers to some of the common queries made by the reviewers: Due to the size limitation, it was not possible to include more results of segmentation, pose estimation, and 3D maps. The first draft of the manuscript was 17 pages and authors did their best to reduce the manuscript size to 10 pages. In this process, several results on the 3D mapping, pose estimation, and segmentation as well as details on the experimental settings such as temporal synchronization of tracker with the camera frame or calibration were omitted. The spelling mistake mentioned by the reviewers will be fixed.

Rev 1

The authors have not indicated willingness to share the source code for their implementation. Neither will the authors share training and validation data. However, training parameters are reported.

The authors will share the data for this work which would be the knee images with anatomical labels. Other datasets such as the sheep joint experiment or the source code could also be shared if there are asked for by readers.

Define the acronyms ACL
Will be corrected (ACL = Articular Cartilage Ligament)

ATE is used as the metric to compare predicted poses to the ground-truth. This metric neglect orientation details. Is orientation less important in the target application? Describe. The rotation error estimated using ATE is provided in Fig 2-(d), right column, 3rd row.

Rev 2

  • Coefficient used during the experiences of alpha in Eq. 2 and lambda_{smoo} in Eq. 5 are not given. These will be corrected: 0.85 and 1e-3

  • It seems you use stereoscopic endoscope, it could be nice to mention it before in the introduction. The use of stereoscope will be mentioned.

  • Equation 3, the min is pixel-wise or image-wise? Should you write Ihat_t \in I_S below ‘min’?
  • Equation 4, I don’t understand this formula. It is the same problem as Eq. 3. There is a confusion between I_S and Ihat^t L_phot (I_t , I_s) is the unwarped pohotometric loss whereas L_phot (I_t , I_t^hat ) is warped. Also L_phot (I_t , I_S) will be changed to L_phot (I_t , I_s). The I_s is a reference image from S images.

  • Equation 5, shouldn’t it be e-|…| with minus? This will be corrected..

The 5 sequences is only from one human? In the experimental setup you mention at least 3 cadaver experiment for the test set. The experimental of more than 3 cadaver was related to the segmentation training (section 4.2). Section 4.1 is related to the depth + pose training.

I am not sure I understand what is Self-supervised only? When L_{pose} is not used? That is true.

  1. Conclusion
    • “The proposed domain adaptive approach” -> What do you mean by domain adaptive approach?

Our model is trained on datasets containing non-human knees that have camera pose annotation, and then adapted to a dataset containing human knees that have semantic segmentation annotation of femur, ACL, and meniscus. Hence, our method needs to adapt between non-human and human knees images, that form two separate data domains.

Rev 4

Do the video images of sheep knee look very similar to cadaver images? Not structurally, but texturally they do.

Rev 5

Qualitative errors shown in Fig. 2(d) are from animal data, which is the texture rich phantom data. That is not correct. The sheep joint is also texture poor similar to the human knee and that is the reason it was included in the training for the sake domain adaptation. The only texture rich data was from the 3D printed knee.

Finally, there is no quantitative evaluation for semantic segmentation. Can authors report something like dice score? Will be mentioned. Mean Intersections over Union (IoU) for three cadaver knee arthroscopic dataset are 0.793, 0.560, and 0.581 for tissue type bone (femur and tibia), ACL and meniscus.



back to top