Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

Although deep learning-based approaches have demonstrated significant potential in surgical scene analysis (Maier-Hein et al., 2017, Loftus et al., 2020, Maier-Hein et al., 2022), enhancing aspects such as intelligent workflow recognition (Garrow et al., 2021, Jin et al., 2022) and scene understanding (Allan et al., 2020, Nwoye et al., 2022), research on advanced precise assistance for surgical procedures remains limited. One of the most critical tasks involves aiding decision-making regarding dissection trajectories (Wang et al., 2022a, Guo et al., 2020, Qin et al., 2020), which is essential for ensuring surgical safety. Endoscopic Submucosal Dissection (ESD), a procedure used to treat early stage gastrointestinal cancers (Zhang et al., 2020, Lau et al., 2021), involves multiple dissection maneuvers that demand substantial experience to identify the optimal path and impose significant stress on surgeons. Providing informative suggestions on dissection trajectories can greatly assist surgeons by reducing operative errors (Kim et al., 2011), minimizing the risk of complications, and providing feedback on surgical skills training (Laurence et al., 2012). However, predicting the optimal path based on endoscopic video is complex. Firstly, determining dissection trajectories is intricate and challenging for even surgeon experts with many years’ experience due to numerous factors, such as the safety margins around the tumor. Secondly, blurred scenes and poor visual conditions can further impede scene recognition (Wang et al., 2022b). To date, no data-driven solutions have been developed to predict dissection trajectories. We argue that it is feasible to learn this skill from expert video demonstrations.

Imitation learning has been extensively researched across various fields due to its strong ability to acquire complex skills (Hussein et al., 2017, Kläser et al., 2021, Le Mero et al., 2022). However, it requires adaptation and enhancement when applied to learn dissection trajectories from surgical data. One major challenge is the inherent uncertainty of future trajectories. Supervised methods, such as Behavior Cloning (BC) (Codevilla et al., 2019), tend to average all potential prediction paths, resulting in inaccurate forecasts. Although advanced probabilistic models aim to capture the complexity and variability of dissection trajectories (Li et al., 2017, Ren et al., 2021, Ke et al., 2021), ensuring reliable predictions across different surgical scenarios remains a significant challenge. To address these issues, implicit models are being developed to represent the policy, leading us to adopt Implicit Behavior Cloning (iBC) (Florence et al., 2022). iBC can learn robust representations by capturing the shared features of visual inputs and trajectory predictions through a unified implicit function, providing superior expressivity and improved visual generalization. Nonetheless, these methods have limitations. For example, techniques utilizing energy-based models (EBMs) (Florence et al., 2022, Jarrett et al., 2020, Ganapathi et al., 2022, Du and Mordatch, 2019) require intensive computations due to their reliance on Langevin dynamics, causing a slow training process. Moreover, their performance can be sensitive to data distribution, and noise in the training data can lead to unstable prediction results of trajectories. In addition, since trajectory prediction tasks inherently encompass geometric symmetries such as rotations, learning implicit policies also has the limitation that their optimization is more complex than learning explicit policies, making it more difficult to leverage symmetries underlying the task.

In this paper, we investigate the task of predicting dissection trajectories in endoscopic submucosal dissection surgery using imitation learning on expert video data. We present a novel method called Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE) for this purpose. The graphical abstract of our method is depicted in Fig. 1. Firstly, to effectively model the surgeon’s behaviors and learn the significant variation in surgical scenes, we employ implicit modeling to represent expert dissection skills. Secondly, to overcome the inefficient training and unstable performance issues associated with implicit policies by energy-based models, we formulate the implicit policy using an unconditional diffusion model, which excels in representing complex high-dimensional data distributions such as images or videos. Additionally, we integrate rotational symmetry into the diffusion model, allowing it to learn the equivariance properties of dissection trajectories. Furthermore, we develop a conditional action inference strategy guided by forward-diffusion to generate predictions from the implicit policy. To evaluate the effectiveness of our method, we curated a surgical video dataset of ESD procedures, comprising nearly two thousand annotated dissection trajectories. Our results demonstrate that our method outperforms state-of-the-art trajectory prediction methods across various surgical scenarios. Our main contributions are as follows: (1) we propose to use diffusion models as a powerful implicit policy learning method for surgical trajectory prediction. Our diffusion-based method enables efficient modeling of complex surgical trajectories directly from high-dimensional endoscopic videos. (2) we propose to explicitly embed geometric equivariance into a diffusion-based imitation learning framework. By explicitly modeling geometric symmetries inherent in dissection trajectories, our method enhances trajectory prediction performance across varied surgical contexts. (3) we evaluate our method comprehensively on real-world endoscopic surgical video datasets. Our experimental results clearly demonstrate the superior performance of our method in trajectory prediction, generalization ability, and robustness compared to prior methods.

A preliminary version of this work was presented in MICCAI 2023 (Li et al., 2023). To further advance our study, we have substantially revised and extend the conference paper. This paper introduces the following extensions to our previous work: (1) we further improve our method by incorporating equivariance in the reverse process of diffusion model for policy learning; (2) We add more ESD surgery cases and conduct comprehensive experiments to evaluate the effectiveness of our method on the extended dataset; (3) our method outperforms competing methods on the dissection trajectory prediction task; and (4) we discuss our method in more details.

Comments (0)

No login
gif