Holistic OR domain modeling: a semantic scene graph approach

In this section, we delineate the methods employed in our study, focusing on the construction of semantic scene graphs, the development of our 4D-OR dataset, the implementation of our scene graph generation pipeline and the downstream tasks of clinical role prediction and surgical phase recognition.

Semantic scene graphs

Semantic scene graphs (SSG) provide a structured representation of objects and their semantic relationships within an environment. They are defined by a set of tuples \(\mathcal = (\mathcal , \mathcal )\), with \(\mathcal = \,\ldots ,n_\}\) a set of nodes and \(\mathcal \subseteq \mathcal \times \mathcal \times \mathcal \) a set of directed edges with relationships \(\mathcal = \,\ldots ,r_\} \) [13]. Within a 3D scene, the corresponding SSG captures the entire environment including the location of each node. In the specific case of an OR, nodes in the graph encompass medical staff and equipment, such as the anesthesia machine or operating table. The edges represent the semantic interactions between nodes, such as a human drilling (into the bone of) the patient, as visualized in Fig. 2.

Fig. 3

We visualize five exemplary relations, as well the number of occurrences of all relations, entities and surgical phases in the 4D-OR dataset

4D-OR dataset

To facilitate the modeling of intricate interactions in an OR using SSGs, we introduce the novel 4D-OR dataset. 4D-OR consists of ten simulated total knee replacement surgeries, which were conducted at a medical simulation center with input from orthopedic surgeons, ensuring a reasonable simulation of the surgical workflow. The actors, comprising three males and two females, were biomedical engineers doing their PhD and were informed by surgeons on the surgical procedure they were simulating. We chose total knee replacement as our intervention type, which is a representative orthopedic surgery, as it encompasses various steps and diverse interactions. 4D-OR contains a total of 6734 scenes, captured by six calibrated RGB-D Kinect sensorsFootnote 1 located at the OR ceiling. We empirically fixed the number of cameras to six to ensure a good trade-off between obtaining comprehensive OR coverage and ensuring practicality in hardware setup. The recording is done in one frame per second and is hardware synchronized across cameras. The average recording duration is 11 min, and the workflow can be seen as a simplified version of the real surgery. The roles of actors were switched regularly to create variety in the dataset. Some examples of activities present in the dataset can be seen in Fig. 3. Notably, 4D-OR is the only semantically annotated OR dataset. In addition to the images and fused 3D point cloud sequences, our dataset contains automatically annotated 6D human poses and 3D bounding boxes for medical equipment. Additionally, we annotate SSG for every time point, accompanied by the clinical roles of all humans present in the scene and surgical phases. For every frame, the authors created one annotation, in collaboration with medical experts.

Scene graph generation

In the task of scene graph generation, the goal is to determine the objects and their semantic connections provided a visual input such as an image or point clouds. To this end, we present a novel end-to-end scene graph generation (SGG) pipeline, which is illustrated in Fig. 1. In our approach, we first identify humans and objects in the OR and extract their visual features. Then, we construct a semantic scene graph by predicting their pairwise relationships. We utilize state-of-the-art human and object pose estimation methods, VoxelPose [25] and Group-Free [26], to estimate the human and object poses, respectively. We design an instance label computation method that uses the predicted poses to assign each point in the point cloud an instance label. Furthermore, to ensure the detection of small and transparent medical instruments, which can be hard to localize in the point cloud, yet that are still represented in our scene graph, we introduce a virtual node termed instrument to represent interactions between humans and medical instruments. For predicting the pairwise relationships, we build upon 3DSSG [17].

3DSSG employs a neural network-based strategy to predict node relationships. It takes a point cloud and corresponding instance labels as input. Two PointNet [27]-based neural networks are utilized to calculate latent features. ObjPointNet processes the point clouds extracted at the object level. RelPointNet, on the other hand, processes object pairs, where for each object pair, it takes the union of the point clouds of the two objects as input. A graph convolutional network is then applied to contextualize the features of nodes and edges. Lastly, multilayer perceptrons are used to process the updated representations and predict object and relation classes. We train our scene graph generation network end-to-end, using the cross-entropy loss. For our SGG method, we design the following OR-specific modifications to 3DSSG:

Multimodality by incorporating images: The OR comprises numerous objects of varying sizes. Small, reflective or transparent instruments, such as scissors or lancets, are not always adequately captured by point clouds, even though their correct identification is crucial for many relationships. The vanilla 3DSSG often struggles with those relationships. Instead, we incorporate images alongside point clouds into our pipeline by extracting global image features using EfficientNet-B5 [28] and aggregating them with the PointNet features, enabling the usage of multimodal input for the scene graph generation.

Data augmentation: To simulate variations in the real world such as different clothing shades, lighting or object sizes, we augment the point clouds during training by applying random scale, position, orientation, brightness and hue changes. For point clouds associated with relationships, we augment the points of both objects separately, simulating them being in varying sizes or positions relative to each other. Finally, we employ a crop-to-hand augmentation, where we randomly crop the point cloud to the vicinity of the hands. This approach implicitly trains the network to concentrate on medical instruments when learning the relations such as cutting, drilling or sawing.

Downstream tasks

We demonstrate the capabilities of our semantic scene graphs in two different downstream tasks: clinical role prediction and surgical phase recognition. The first aims to predict the role of medical staff in the OR, while the latter aims to determine the current phase of the surgery. Both tasks only utilize the SSG and no additional visual input. They benefit from the rich structural information provided by the SSG.

Clinical role prediction: To identify each individual’s role in the surgical setting, we first calculate a track T for each person using a Hungarian matching algorithm that leverages detected poses at each time stamp. Each track T, with a duration of K, consists of a selection of generated scene graphs \(G_\) where \(i = \) and a related human node \(n_\) for the track. The process of assigning clinical roles involves two primary steps: computing role likelihoods and assigning unique roles. For each track T, we compute a probability score indicating the likelihood of a specific role. We employ Graphormer [29], to process all the scene graphs within the track \(G_T\). By designating nodes \(n_\) as target in the respective graph \(G_\), the network discerns which node embedding corresponds to the role. We compute the mean target node embedding over all the scene graphs in \(G_T\) and predict clinical role scores using a linear layer trained with cross-entropy loss. Additionally, we introduce a heuristic-based method as a non-learning alternative for comparison, which uses the frequency of relations associated with each human node. For instance, the score for the head surgeon role increases with each sawing relation, while the score for the patient role increases with each lying on relation. Once clinical role likelihoods are computed, we deduce the clinical role of a human node by solving a matching problem. By retrieving role probabilities for each track, we match roles to nodes bijectively based on their probabilities, ensuring that each human node in the scene receives a distinct role, with the following algorithm:

1.

For each human node, retrieve the associated role probabilities.

2.

Identify the node with the highest probability for a specific role.

3.

Assign that role to the node with the highest probability.

4.

Remove the assigned role from the role probabilities of all other nodes.

5.

Renormalize the role probabilities for the remaining nodes.

6.

Repeat steps 2–5 until each node has a unique role assignment.

Surgical phase recognition: To detect the different phases of the surgical procedure, we first divide the surgery into eight distinct phases as enlisted in Table 3. For defining the phases, we follow the definitions of Sharghi et al. [23]. The phases with a “Surgery:” prefix imply main surgical operations, i.e., when the patient would be under anesthesia. Given the predicted scene graphs G from a surgery, we first enhance them by predicting the clinical roles of the medical staff. Then, we determine the correct phase corresponding to each scene by querying the scene graphs for specific triplets, such as “head surgeon sawing patient,” which we map to certain surgical phases. As our surgical phase recognition algorithm itself does not rely on a learning-based approach, it is transparent and does not need any additional annotations. As our semantic scene graphs already summarize the surgery at a high level, the detection of phases can be achieved with the following heuristics:

1.

OR Preparation: SG does not include \(\textbf\) and surgery did not start

2.

Patient Roll-In: SG includes \(\textbf\) and \(operating\ \mathbf \)

3.

Patient Preparation: SG includes \(\mathbf \ preparing\ \textbf\) and \(\mathbf \ preparing\ \textbf\)

4.

Implant Placement Preparation: SG includes \(\mathbf \ cutting\ \textbf\)

5.

Implant Placement: SG includes \(\mathbf \ hammering\ \textbf\)

6.

Conclusion: SG includes \(\mathbf \ cementing\ \textbf\)