TransVFS: A spatio-temporal local–global transformer for vision-based force sensing during ultrasound-guided prostate biopsy

Prostate cancer (PCa) is the most prevalent malignant tumor in the male reproductive system and one of the main causes of cancer-related death in men (Pinsky and Parnes, 2023). In clinical practice, the surgeon performs a digital rectal examination (DRE) and prostate-specific antigen (PSA) based test to screen potential PCa-affected patients (Matuszczak et al., 2021). After feeling distinct tactile differences and observing elevated PSA levels, the surgeon relies on more imaging examinations, such as magnetic resonance imaging (MRI) or ultrasound (US) scanning, to further locate the cancer. If some suspected lesions appear on MR images but are invisible on US images, the surgeon will manually conduct transrectal ultrasound (TRUS)-guided prostate biopsy to reexamine these areas, which serves as a gold standard to diagnose and evaluate the aggressiveness of PCa (Tucan et al., 2017). However, the conventional manual prostate biopsy operation depends on operators’ experience and its efficiency and accuracy will be influenced by such factors as hand tremor and fatigue resulting from non-intuitive hand-eye coordination and prolonged neurological concentration (Wang et al., 2023). Compared with manual biopsy, the surgical robot based biopsy has the advantages of high degree of automation, millimeter-scale error, highly decreased human workload, and shorter operation time (Tokuda et al., 2012, Zhang et al., 2023). Benefiting from these strengths, the robot-assisted prostate biopsy system has attracted extensive attention and has opened up new research directions (Wang et al., 2021a).

Since US is low-cost, real-time and radiation-free, it has been widely used to guide robot-assisted prostate biopsy (Ho et al., 2011, Long et al., 2012, Poquet et al., 2015, Lim et al., 2019, Yan et al., 2022). For example, Prosper, developed by Long et al. (2012), could freely adjust needle’s rotation angle and track prostate movement intraoperatively. Apollo designed by Poquet et al. (2015), adopted modulated operation modules which could closely meet the surgeon’s needs. Lim et al. (2019) proposed a US-guided 6-degree-of-freedom (DOF) biopsy robot and adopted a belt-driven remote center of motion (RCM) structure, which was more compact and provided 1 mm placement error in the clinical trial. Meanwhile, some MRI/US fusion-guided prostate biopsy robots have also been investigated, e.g., BIO-PROS-1 and BIO-PROS-2 designed by Pisla et al., 2017, Pisla et al., 2016 and PROST proposed by Maris et al. (2022), where MRI is treated as an auxiliary modality to assist US in locating the suspected lesions.

However, existing prostate biopsy robots lack in the sensing of interaction force between the surgical instrument and the soft tissue, and many surgical robots applied in other minimal invasive surgery (MIS) scenarios share the same drawback (Ehrampoosh et al., 2013, Bayle et al., 2014). Force sensing has been proven to be highly important for safe intervention operations (Abdi et al., 2020). With the holistic force sensing and feedback, surgeons can be provided with richer information about the contacted tissue’s stiffness and shape, enabling the differentiation of abnormal tissues like tumor (Okamura et al., 2010). Meanwhile, the precise force feedback can help to ensure that the applied force is in a safe range as excessive forces may lead to the unexpected trauma of delicate internal organs (Zareinia et al., 2016). Furthermore, it is of great significance to ensure the fast and accurate intervention surgery by exploiting the fed force and torque signals to assist in motion planning, robot control and surgical instrument manipulation (Kuang et al., 2020, Gidde et al., 2023).

In clinical practice, achieving fast and accurate force sensing is important to control biopsy robots, where the real-time response and relatively small estimated force and torque errors are the major criteria to evaluate the success of force sensing method (Kueffer et al., 2023). To this end, we aim to provide biopsy robots with the ability to sense the interaction force quickly and accurately, thereby reducing surgeons’ labor intensity and further boosting the safety of biopsy operations.

To address the lack of force sensing and feedback ability of surgical robots, direct and indirect sensing approaches have been widely explored. In direct force sensing, a sensor is generally positioned at the distal or proximal end of the surgical instrument to realize force measurement. Although the distal sensor based measurement methods (Li et al., 2020, Li et al., 2021) can represent the most intuitive interaction between surgical instrument and tissue, they are influenced by the biocompatibility, sterilization, size and cost of force sensors (Stephens et al., 2019). Meanwhile, the proximal sensor based measurement approaches (Lai et al., 2019, Fontanelli et al., 2020) may lead to inaccurate force measurement when the surgical instrument is bent due to tissue compression (Ravali and Manivannan, 2017).

As an alternative of direct sensing approaches, the indirect force sensing can be realized by means of control-based or vision-based strategies. As regards control-based approaches, the interaction force estimation is realized by analyzing the available information from physical parameters such as surgical instrument positions and motor currents (Lee et al., 2018, Lee et al., 2021, Guo et al., 2021). Generally, these approaches are limited to low-dimensional data representation, which leads to their insufficient performance in complex force sensing tasks. In contrast, the vision-based force sensing (VFS) methods explore the deformation of soft tissues in images for estimating the interaction forces. The VFS methods do not require additional hardware, are easier to deploy than sensor-based approaches, and can provide potentially more accurate force estimation than control-based ones as images contain richer information than the above-mentioned physical parameters. Benefiting from the aforementioned strengths, the VFS has attracted extensive attention and numerous VFS-related approaches have been proposed (Marban et al., 2019, Gessert et al., 2019, Gessert et al., 2020).

The VFS methods aims at estimating the force using images from such imaging devices as RGB-Depth (RGB-D) cameras and Optical Coherent Tomography (OCT) instruments. In early studies, many VFS methods were based on the human-designed feature extraction algorithms or machine learning models. For instance, Greminger and Nelson (2004) proposed a deformable template matching approach to estimate the force distribution by inferring the displacement of the object’s contour. Mozaffari et al. (2014) used a neuro-evolutionary fuzzy system to identify the tool-tissue interaction force during the robotic laparoscopic surgery. Khoshnam et al. (2015) proposed a kinematic model to measure the interaction force between catheter and cardiac tissue. However, these methods are mainly limited to the simulation environment, which prohibits their applications in the clinical practice.

Recently, with the advancement of deep learning (DL), many neural network related VFS methods have been presented. Marban et al. (2019) utilized four RGB-D cameras to capture the deformation of liver phantom and applied a recurrent convolutional neural network (CNN) to estimate the contact force. Specifically, the long short-term memory (LSTM) network including four modes of gates was used in Marban et al. (2019) to model long-distance relationship. Fekri et al. (2022) proposed a simple but effective CNN to estimate the contact forces for intracardiac catheters. Gessert et al. (2020) confirmed the possibility of realizing VFS using 4D OCT volume sequence and further investigated the effects of four different spatio-temporal schemes. However, these methods work on the surface or sub-surface images of the tissue from the optical imaging instruments and they are not suitable for prostate biopsy where the overall structure of prostate needs to be imaged for force sensing. Besides, these DL based methods are mainly combinations of CNN and recurrent neural network (RNN), which cannot sense the long-range spatio-temporal relationships, leading to their insufficient modeling of complicated prostate deformation and inaccurate force sensing.

Transformer, which can model the long-range dependency based on the attention mechanism, provides a potential solution to the above-mentioned problem. It derives from natural language processing domain (Vaswani et al., 2017) firstly and quickly burgeons in computer vision (CV) domain. Vision Transformer (ViT) (Dosovitskiy et al., 2021) is an important milestone in Transformer’s development and its variants have been presented for image processing (Wang et al., 2021b, Liu et al., 2021) and video understanding (Fan et al., 2021, Liang et al., 2022, Li et al., 2022, Weng et al., 2022, Ahn et al., 2023). Since most VFS methods relate to spatio-temporal learning process, some Transformer networks specifically developed for video-based CV tasks are our major concern. For example, a multi-scale ViT was presented in Fan et al. (2021) to reduce complexity by pooling the length of query, key and value features. A stratified Transformer was proposed in Liang et al. (2022) to relieve the extra burden in video action recognition by utilizing the local window separation and global feature aggregation strategies. Li et al. (2022) proposed a unified Transformer with both convolution-based local module and attention mechanism-based global module. Weng et al. (2022) analyzed the necessity of applying the global attention module in all stages of Transformer, and rearranged the order of the local and global attention modules. However, these methods are mainly related to 3D video paraphrasing and very little research has been done on 4D data processing.

Besides, when it comes to the medical domain, fewer video Transformer approaches have been explored. Gao et al. (2021) proposed a sandwich-shaped CNN-Transformer hybrid network to realize surgical phase recognition. Jin et al. (2022) introduced swin-Transformer (Liu et al., 2021) into the surgical semantic scene segmentation and it outperformed other compared methods by utilizing the shifted window mechanism. Inspired by ViT, Nwoye et al. (2022) proposed a novel attention module to effectively capture the relationship between different instruments within the surgical video. Lyu et al. (2023) combined Transformer with generative adversarial network to reconstruct cardiac cine MR images. Nevertheless, these video Transformer methods mostly focus on the 2D surgical video but seldom consider the 4D volume sequence.

留言 (0)

沒有登入
gif