NeRF-OR: neural radiance fields for operating room scene reconstruction from sparse-view RGB-D videos

Neural radiance fields

Neural radiance fields (NeRF) are a method for rendering synthetic images of a volumetric scene from arbitrary camera angles [11]. It uses a multilayer perceptron (MLP) \(F_\Theta \) that represents the scene. A dynamic NeRF can render videos from a scene that changes over time. Training a dynamic NeRF requires a set of images that capture the scene at different time values t and viewing directions d, together with the intrinsic and extrinsic parameters of the cameras used to obtain these images. To optimize a NeRF, camera rays r are cast through the scene originating from randomly selected image pixels. A set of spatial 3D locations \(\\) is sampled along each ray. The MLP is queried with the locations, time values, and viewing directions and returns a material density \(\sigma \) and color c for each sample:

$$\begin (c, \sigma ) = F_\Theta (\gamma (x, t), d), \end$$

(1)

where \(\gamma \) is the positional encoding function for locations and time values before they are given to the MLP. Thereafter, the material densities and colors are accumulated along the ray using quadrature resulting in a predicted pixel color:

$$\begin \hat(r) = \sum ^S_ U_i (1 - \text (-\sigma _i \delta _i)) c_i \end$$

(2)

where \(U_i = \text \big (-\sum ^_\sigma _j\delta _j\big )\), and \(\delta _i\) is the distance between the sampled points along a ray. The predicted colors are compared with the actual colors in the training images:

$$\begin L_} = \sum _ \Vert \hat(r) - C(r) \Vert _2^2 \end$$

(3)

Subsequently, the error is back-propagated to optimize the MLP, and to the encoding function as well if learnable weights are involved.

Fig. 2figure 2

Inner workings of NeRF-OR. Patches of size (M, N) are randomly sampled from training views with viewing angle d. A ray r is cast from each pixel in the patch through the 3D scene. Points x are sampled along the ray and embedded with hash encoding, together with time value t. NeRF-OR uses element-wise addition of static and dynamic encoding. An MLP returns color c and material density \(\sigma \), used for volume rendering. The method outputs predicted color, depth, and surface normals

NeRF-OR

We adapt the original NeRF architecture such that it learns geometrically accurate representations of scenes with sparse training views that have very different vantage points. The most prominent adaptations can be seen in Fig. 1, where the method requires additional input in terms of ToF sensor depth and surface normals that are calculated from dense depth. An overview of how NeRF-OR works internally is given in Fig. 2. During training, patches are randomly sampled from training views at time step \(t \in \\). For each pixel in the patch, a ray r is cast through the virtual scene along which S points are sampled: \(\\). To speed up the training of the model, we use hash encoding [24], where spatial locations are encoded with a 3D hash grid and spatiotemporal locations are encoded with a 4D hash grid. Camera viewing angle d is encoded with spherical harmonics (SH). The MLP learns a function that outputs colors and material densities. These are used during volumetric rendering to provide a predicted color for each pixel. Contrary to the original NeRF design, NeRF-OR outputs predicted depth maps and surface normals as well.

Sensor depth supervision

Similar to our earlier work [20], we use an additional loss function where predicted depth \(\hat(r)\) is compared to ToF depth measurement \(D_}(r)\):

$$\begin \hat(r)= & \sum ^S_ U_i (1 - \text (-\sigma _i \delta _i)) a_i \end$$

(4)

$$\begin L_}= & \sum _ ( \hat(r) - D_}(r) )^2 \end$$

(5)

where \(a_i\) is the distance between sampled point i and the camera. Note that these distances should have the same metric as the sensor depth, e.g., millimeters, and the depth images should be transformed from the viewpoint of the sensor to the viewpoint of the RGB camera.

Surface normals regularization

A disadvantage of ToF depth is that its images are incomplete, i.e., they display zero values for pixels outside the sensor border, at reflective materials, and around object boundaries. Such missing values can be seen in the example depth image in Fig. 4, e.g., at the table in the upper-left corner. As an alternative, it is possible to use dense depth derived from color images with monocular depth estimators. Recently, several of these estimators were proposed that achieve good results [23, 25, 26]. In contrast to ToF depth, these images guarantee values for all pixels, while their values are relatively smooth and particularly well represented at object boundaries. However, in contrast to ToF depth, dense depth estimation does not provide absolute depth values. The scale of depth is relative and is not related to the camera coordinate system. Typically, these depth values are between 0 and 1 [23]. Therefore, we propose to use a combination of measured ToF depth and depths estimated from RGB images. Depth values from ToF result in material density that is correctly positioned in the 3D scene, while estimated depth helps to find the relative depths for areas where no sensor depth is available. Because we are interested in the relative change of depth, we use surface normals calculated from the estimated depth [21]. These normals represent the direction of change, which is distinctive at object boundaries and smooth at surfaces. We compute the surface normals N(r) by normalizing the depth gradient in the horizontal and vertical directions to a unit vector and construct the loss function as follows:

$$\begin & N(r) = \dfrac \end$$

(6)

$$\begin & L_} = \sum _ \Vert \hat(r) - N(r) \Vert ^2 \end$$

(7)

In Eq. 6, D is the depth map estimated from a color image. For the final loss, we combine the three losses with weighting factors \(\alpha \) and \(\beta \):

$$\begin L = L_} + \alpha L_} + \beta L_} \end$$

(8)

Patch-based training

To enable supervision with surface normals, we shift from standard ray-based to patch-based training. Instead of selecting individual rays at random, we select random patches P of size (M, N). We sample all rays \(r_\) in the patch, where \(i \in \\) and \(j \in \\), and let NeRF-OR return a color, depth, and surface normal for each pixel in the patch. An illustration of this approach is given in Fig. 2. To supervise surface normals at different scales, we sample the patches at \(P_}\) multiple resolutions. At each scale, the patch is sampled with a stride of \(2^l\) with \(l \in \}-1\}\).

4D hash encoding for dynamic scenes

A drawback of building dynamic NeRF representations are the generally large training times, requiring up to 1300 GPU hours for a video of 300 frames [27]. To accelerate optimization, we adopt hash encoding from Müller et al. [24]. We extend this approach to encode spatiotemporal positions instead of spatial positions only. Similar to the work from Park et al. [28], NeRF-OR creates a static feature embedding using 3D hash encoding while constructing a dynamic feature embedding using 4D hash encoding. In the latter encoding method, the hash grid is 4-dimensional. We combine the two such that static scene elements are encoded with a 3D hash grid since they do not move over time, while the dynamic elements are encoded with a 4D hash grid. Encoding all material with a 4D hash grid would take a lot of unnecessary memory. Importantly, the maximum resolution of the grid (\(N_}\)) is different in the spatial dimensions than for the temporal dimension, where we always choose the video length T as the maximum resolution. To get the final embeddings, we add the static and dynamic feature embeddings element-wise, as we found empirically no added value in the concatenation of the two:

$$\begin \gamma (x, t) = \gamma _}(x) + \gamma _}(x, t) \end$$

(9)

Fig. 3figure 3

Qualitative comparison between SparseNeRF [22], Dynamic DS-NeRF [20], and NeRF-OR. Where SparseNeRF collapses to a sub-optimal solution, the other methods generate geometrically correct reconstructions. Below, three zoom-ins are presented of images synthesized by the latter two methods. In contrast to Dynamic DS-NeRF, the proposed method grasps fine details such as the keyboard, small instruments, or the surgeon’s face

Datasets

We evaluate our method with two datasets. First, NeRF-OR is applied to dynamic scenes in the 4D-OR dataset [6] that display acted-out knee surgeries. This dataset consists of videos acquired by six fixed Azure Kinect RGB-D cameras spread throughout the OR, capturing the scene from very different perspectives. Camera locations and viewing directions are calculated with external calibration. Second, we use the NVS-RGBD benchmark dataset [22]. This benchmark includes eight static non-surgical scenes captured with a moving Azure Kinect RGB-D camera. Additional scenes are available where depth is recorded with the ZED 2 or iPhone LiDAR. However, we do not use these as the depth maps are not stored in absolute values or the scenes were not used for benchmarking earlier. Each scene consists of three randomly selected training images and various test images captured from unobserved camera angles. We assess image synthesis quality by comparison with test images in terms of PSNR, SSIM [29], and LPIPS [30].

Implementation details

We train all NeRF-OR models with a patch size of \(8\times 8\) pixels, \(P_}\) of 4, loss weights \(\alpha \) and \(\beta \) equal to 0.5, a batch size of 64 rays with 16,384 points sampled along each ray, and learning rate = 0.01. For the MLPs, we choose 3 layers with 256 neurons each. For the hash encoding, we choose: \(N_} = 16\), \(N_} = 2048\), and \(T = 2^\). When learning static scenes, NeRF-OR is trained for 2K iterations in 40 min on a single GPU. For dynamic scenes, we train for 10K iterations in 12 h on 4 GPUs. We use 48GB NVIDIA A40 GPUs and synthesize a single \(640 \times 368\) image in 6 s. We use Marigold [23] as depth estimator, although other methods could be used as well.

Comments (0)

No login
gif