Next-generation surgical navigation: Marker-less multi-view 6DoF pose estimation of surgical instruments

Computer-assisted interventions have benefited significantly from advances in computer vision (Mascagni et al., 2022) to increase autonomy, accuracy, and usability for tasks such as navigation, surgical robotics, surgical phase recognition, or automated performance assessment (Farshad et al., 2021, Doughty and Ghugre, 2022, Haidegger et al., 2022, Garrow et al., 2021, Lam et al., 2022). While most methods are currently being studied in isolation for specific use cases, the intention is to integrate them holistically in a new generation of operating rooms optimized for the utilization of computer vision (Feußner and Park, 2017, Maier-Hein et al., 2022, Özsoy et al., 2023). Hereby, the data streams are utilized to support the surgical staff in all relevant aspects of a surgery ranging from clinical process optimization to precision surgery (Özsoy et al., 2022).

In precision surgery, a particular significance is attributed to surgical navigation, which improves the safety and efficiency of interactions between the surgeon, instruments, and the patient (Virk and Qureshi, 2019). Marker-based navigation systems have been available for more than two decades and have been shown to increase accuracy and reduce revision rates (Girardi et al., 1999, Luther et al., 2015, Perdomo-Pantoja et al., 2019). However, their limited applicability and inherent technical restrictions such as line-of-sight issues, extensive calibration requirements, and the impracticality of large tracking markers complicate their integration into existing workflows and limit acceptance and dissemination (Härtl et al., 2013, Joskowicz and Hazan, 2016). In contrast, marker-less approaches have significant potential to seamlessly integrate into the surgical workflow and considerably reduce logistics and calibration overhead.

As a fundamental computer vision problem, marker-less object pose estimation remains an active research focus with a continuously improving state of the art. Outside of the medical domain, most proposed methods operate on single RGB frames due to their broad applicability (Hinterstoisser et al., 2012, Xiang et al., 2018, Wang et al., 2021), however, their accuracy is constrained by depth ambiguities. Other works address this limitation by incorporating RGB-D sensors (Labbé et al., 2020, Haugaard and Buch, 2022) or multiple cameras (Labbé et al., 2020, Shugurov et al., 2021, Haugaard and Iversen, 2023). In particular, multi-view methods show potential for high pose accuracy and occlusion robustness due to the redundancy of multiple viewpoints and the robust triangulation in wide baseline camera setups. Such state-of-the-art object pose estimation methods have been successfully applied in various fields like robotic grasping (Wang et al., 2019), augmented reality (Liu et al., 2022), or outer space (Hu et al., 2021). However, a systematic evaluation of the feasibility and requirements of these methods in surgery is still lacking, primarily due to the absence of publicly available datasets for training and evaluation. This lack of suitable benchmarks has been recognized as a key challenge in translating state-of-the-art methods to the surgical domain (Bouget et al., 2017, Mascagni et al., 2022).

Several works have investigated marker-less approaches for pose estimation and tracking of surgical instruments, however, the proposed approaches are often based on strong assumptions about the instrument shape (Hasan et al., 2021, Chiu et al., 2022) or image appearance (Allan et al., 2015). These assumptions restrict their generalization and applicability to a broader range of instruments and use cases. Other works propose registration-based methods with depth sensors (Lee et al., 2017), or exploit correlations between the hand and hand-held instrument for pose estimation (Hein et al., 2021, Doughty and Ghugre, 2022). Still, these monocular methods fail to achieve sufficient accuracy due to their limited robustness to occlusions and noisy depth measurements. Despite the evident potential of multi-view methods, no such approach has yet been proposed for surgical instrument pose estimation or tracking.

Dedicated multi-view datasets can support the development of multi-view approaches, however, such datasets remain scarce in both quality and quantity. In the surgical domain, most existing datasets provide 2D annotations such as bounding boxes, tool tip positions, or segmentation masks (Sarikaya et al., 2017, Allan et al., 2020), but lack 6DoF pose annotations due to the added complexity during data acquisition. To address this challenge, some datasets automatically annotate 6DoF instrument poses based on the surgeon’s hand pose and grasp information (Hein et al., 2021, Wang et al., 2023). However, the accuracy of the estimated instrument pose is often insufficient for clinical applications due to accumulating errors in the hand pose and grasp estimation. A notable exception is datasets collected on the Da Vinci robotic platform (Allan et al., 2015, Speidel et al., 2023). While these datasets include accurate 6DoF pose annotations, they are inherently limited to minimally invasive surgery and the specific robotic instruments used with the Da Vinci system. Complementary to real-world data collection, some works generate synthetic images of hand-held surgical instruments (Hein et al., 2021, Birlo et al., 2024) to support the training process. Nevertheless, real-world data remains essential for evaluating a method’s accuracy under realistic conditions. To the best of our knowledge, no publicly available benchmark exists that enables a systematic evaluation of state-of-the-art single-view and multi-view approaches, based on RGB or RGB-D data, for surgical instrument tracking.

In this work, we address the existing limitations in surgical instrument tracking through three key contributions. First, we introduce the first public and comprehensive multi-modal and multi-camera spine surgery dataset to overcome the lack of benchmarks. This dataset includes 23 recordings of surgical procedures on human ex-vivo anatomy performed by five operators using two distinct instruments. The data capture setup comprises RGB-D video streams from seven cameras, including static and head-mounted configurations, collected in both a surgical wet lab and a mock operating room (see Fig. 1). A marker-based tracking system with sub-millimeter accuracy provides precise pose annotations for the surgical instruments, patient anatomy, and head-mounted devices (HMDs). This dataset establishes a robust benchmark for advancing research on pose estimation and tracking of surgical instruments. Moreover, the rich annotations and modalities broaden the dataset’s applicability to several related tasks such as hand or joint hand-object pose estimation and tracking (Hein et al., 2021, Wang et al., 2023), reconstruction (Leng et al., 2023), or novel view synthesis (Mildenhall et al., 2021, Truong et al., 2023). In the clinical context, our dataset can serve as the basis for surgical behavioral and interaction models based on the provided instrument-, hand- and anatomy poses and eye gaze information displayed in Fig. 2, Fig. 3. Moreover, the instrument and anatomy information can be used to render digitally reconstructed radiographs (DRRs) of realistic instrument trajectories, enabling the training of pose estimation and phase detection models in the X-ray domain (Kügler et al., 2020, Killeen et al., 2023).

Second, we conduct an extensive evaluation of pose estimation methods to assess the feasibility of marker-less surgical instrument tracking. This evaluation benchmarks three state-of-the-art single-view and multi-view methods, examining the influence of camera quantity and placement, ego-centric perspectives from HMDs, and varying camera configurations, including static, hybrid, and fully mobile setups. Furthermore, we analyze how different training strategies and limited real-world training data impact pose accuracy, occlusion robustness, and generalizability.

Third, we propose a 6DoF instrument tracking system and training strategy based on the results of our evaluation. The system integrates multiple off-the-shelf cameras with state-of-the-art pose estimation methods to address the challenges in the operating room. We demonstrate that marker-less tracking is becoming a viable alternative to existing marker-based navigation systems. The dataset is publicly available on our project page https://jonashein.github.io/mvpsp/.

Comments (0)

No login
gif