The network architecture of our SG-CNN. The yellow region presents the main segmentation stream, and the green region presents the skeleton-guided stream. Feature maps in main segmentation stream are fed into the skeleton-guided stream for skeleton extraction. Extracted skeleton feature maps are fed back to main segmentation stream to strengthen tubular structure segmentation
Skeleton-guided convolutional neural network (SG-CNN)We first introduce the architecture of our skeleton-guided convolutional neural network (SG-CNN) as shown in Fig. 1. The SG-CNN has two streams: the main segmentation stream and the skeleton-guided stream. The main segmentation stream is used for tubular structure segmentation, and the skeleton-guided stream is specifically designed for extracting the skeleton map of the tubular structure.
Fig. 2Details of the residual block and the skeleton-attention block in our SG-CNN. a residual block b skeleton-attention block
Main segmentation streamThe main segmentation stream refers to the generic encoder–decoder architecture [5] with skip connections. The encoder module is composed of three downsampling blocks, each containing two \(3 \times 3 \times 3\) kernel of convolutional layers, followed by a max pooling layer. The decoder module symmetrically has three upsampling blocks in respect to the encoder module. Each upsampling block includes two convolutional layers with \(3 \times 3 \times 3\) kernels, followed by a transpose convolutional layer with a \(2 \times 2 \times 2\) kernel for upsampling, and concatenation with corresponding feature maps from the encoder module and the skeleton-guided stream. Finally, a convolutional layer with a \(1 \times 1 \times 1\) kernel and a sigmoid layer is utilized to predict segmentation likelihood maps.
Skeleton-guided streamThe skeleton-guided stream comprises two skeleton-guided blocks corresponding to the feature maps generated by the encoder module. Each skeleton-guided block has a residual block, followed by a skeleton-attention block.
The residual block consists of two \(3 \times 3 \times 3\) kernel-based convolutional layers with a skip connection, as shown in Fig. 2a. Firstly, the high-level feature map of the encoder module is fed into the residual block to enhance feature representations. The skeleton-attention block necessitates a pair of inputs, derived from the corresponding feature maps of both the encoder module and the preceding residual block.
The skeleton-attention block guides to extract skeleton information of tubular structures via an attention mechanism, as shown in Fig. 2b. Let \(\textbf_\textrm^i\) and \(\textbf_\textrm^i\) denote the input feature maps from the skeleton-guided stream and the main segmentation stream at the \(i-\)th skeleton-attention block, respectively. We first calculate an attention map \(\textbf^i\) that is written as
$$\begin \textbf^i = \sigma (\textrm(\textrm_(\textbf_\textrm^i) + \textrm_(\textbf_\textrm^i))). \end$$
(1)
where \(\textrm_\) represents \(1 \times 1 \times 1\) kernel convolutional layer. \(\sigma \) and \(\textrm\) represent a rectified linear unit (ReLU) and a sigmoid function \(\sigma (\cdot )\), respectively. Note here that our attention map is acquired by leveraging two input feature maps, which is different from the self-attention mechanism [6]. Subsequently, the attention map \(\textbf^i\) is multiplied by the input \(\textbf_\textrm^i\) to obtain an attention-guided feature map, and the map then propagates to the next skeleton-guided block for extracting skeleton information from additional feature maps. The prediction of the skeleton-guided stream is acquired similarly to the main segmentation stream. The extracted skeleton features were fed back to the main segmentation stream through concatenation. We employ a skeletonization algorithm [7] to tubular structure ground truth to acquire binary skeleton ground truth. For 3D binary skeleton ground truth, we use 6-adjacency for the foreground and 26-adjacency for the background. An example of the skeleton ground truth is shown in Fig. 3.
Loss function for skeleton segmentationIn a skeleton ground truth map, the foreground voxels are much fewer than the background voxels, which might cause the gradient imbalance issue [8]. Given a foreground voxel and a background voxel, the gradient ratio represents the ratio of the foreground gradient’s magnitude to the background gradient’s magnitude. If a foreground voxel is surrounded by background voxels and the gradient ratio is small, the foreground gradients might be eroded by the surrounding background gradients, which denotes the gradient imbalance issue [8]. Our proposed sigmoid-adaptive Tversky loss function (STLoss) aims to mitigate the gradient imbalance issue and improve segmentation performance on skeletons.
Gradient imbalanceThe gradient imbalance could sensibly impact the segmentation performance. In this paper, we use the Dice loss [9] to analyze the gradient imbalance of a prediction map. The Dice loss is written as
$$\begin L_D (\textbf, \textbf) = 1 - \frac^p_ig_i}^p_i + \sum _^g_i}, \end$$
(2)
where \(p_i \in \textbf\) and \(g_i \in \textbf\) denote the predicted likelihood result and the ground truth at the \(i-\)th voxel. \(p_i\) is in the range (0, 1), and \(g_i\) takes a binary value of 0 or 1. The voxel index is represented by \(i\in [1, N]\). This loss assigns the same foreground gradient to each foreground voxel, denoted as \(p_f\), and the same background gradient to each background voxel denoted as \(p_b\). The gradient ratio of the foreground voxel to the background voxel is
$$\begin R_ (\textbf, \textbf)&= \left|\frac \Big / \frac\right|\nonumber \\ &= \left|\frac^p_i + \sum _^g_i\right) -\sum _^p_ig_i\right) }^p_ig_i}\right| \nonumber \\&= \frac, \textbf)} - 1. \end$$
(3)
Equation (3) indicates that the gradient ratio \(R_\) is proportion to the Dice loss. When the overall Dice loss of a segmentation map is relatively small, the corresponding gradient ratio \(R_D\) tends to be small. This scenario can lead to a gradient erosion problem. Consequently, the training process may struggle to classify foreground voxels surrounded by background voxels.
Sigmoid-adaptive Tversky loss (STL)As aforementioned, the gradient ratio tends to be small as the Dice loss decreases. Consequently, it is difficult for the network to segment foreground voxels surrounded by background voxels.
Fig. 3An example of the airway ground truth and the corresponding skeleton. a airway ground truth. b the corresponding skeleton ground truth. The skeleton ground truth is calculated by applying a skeletonization algorithm [7] to the airway ground truth
The Tversky loss [10] mitigates the gradient imbalance issue by adding hyper-parameters \(\alpha \) and \(\beta \). The Tversky loss could be written as
$$\begin L_T (\textbf, \textbf) = 1 - \frac^p_ig_i}^\alpha p_i + \sum _^\beta g_i}, \end$$
(4)
where \(\alpha +\beta =1\). Similarly to the Dice loss, we could calculate the gradient ratio of the Tversky loss is
$$\begin R_ (\textbf, \textbf) = \left|\frac \Big / \frac\right| =\frac\cdot \frac, \textbf)} - 1. \end$$
(5)
We observe that the gradient ratio in the Tversky loss is influenced not only by the overall loss but also by the hyper-parameter \(\alpha \). However, relying on adjusting hyper-parameters cannot adaptively modify the gradient ratio for every voxel. The final prediction results could be dilated if the gradient ratio is too large [8].
We expect a loss function that could adaptively adjust the gradient ratio in a voxel-wise manner. To this end, we propose our sigmoid-adaptive Tversky loss (STL) which could voxel-wisely adjust the gradient ratio. Our loss function focuses on foreground voxels and background voxels with low prediction confidence, i.e., the prediction likelihood is close to 0.5. These voxels are generally hard to segment. The STL achieves a voxel-wise focal function by integrating a modified sigmoid function,
$$\begin L_ (\textbf, \textbf) = 1 - \frac^\sigma (\gamma (p_i-0.5))g_i}^\alpha p_i + \sum _^\beta g_i}, \end$$
(6)
where \(\sigma (\cdot )\) represents the sigmoid function. \(\alpha \) and \(\beta \) are two hyper-parameters that are similar to those of the Tversky loss, and \(\alpha + \beta =1\). \(\gamma \) is the third hyper-parameter and the \(\gamma (p_i - 0.5)\) term in Eq. (6) mapping the \(\sigma (\gamma (p_i-0.5))\) term to the value range (0, 1). A larger \(\gamma \) causes a larger gradient magnitude when \(p_i\) closes to 0.5. The gradient ratio of our STL loss is calculated by,
$$\begin R_ (\textbf, \textbf)&= \left|\frac} \Big / \frac}\right| \nonumber \\&\quad =\frac0.5))(1\sigma (\gamma (p_i-0.5)))}\nonumber \\&\quad \cdot \frac L_(\textbf, \textbf)} 1, \end$$
(7)
where \(p_i\) is the prediction likelihood of the \(i-\)th voxel. In our STL, voxels are adaptively assigned a gradient ratio according to the prediction likelihood \(p_i\). Our loss function is different from the general class-wise weighting loss function with a voxel-wise adaption manner. Foreground voxels with low prediction confidence have a large gradient ratio, encouraging the network to learn from and improve the segmentation performance. An intuitive comparison of different losses is shown in Fig. 4. The hyper-parameters were empirically selected. We could observe that the loss variation trend of our STL is more pronounced compared to the other two losses when the prediction probability approaches 0.5. This suggests that the voxel with low prediction confidence is assigned a larger gradient magnitude compared to the voxel with high prediction confidence, achieving a voxel-wise gradient ratio adaptation.
Fig. 4The comparison of losses. For the Tversky loss, the \(\alpha \) and \(\beta \) were set to 0.1 and 0.9, respectively. For our STL, the \(\alpha \), \(\beta \), and \(\gamma \) were set to 0.1, 0.9, and 10, respectively. We noted that our STL exhibits more intense variability compared to other losses when the prediction probability approaches 0.5
Table 1 The acquisition parameters of datasetsIn our overall loss function, we utilize the Dice loss for tubular structure segmentation, denoted as \(L_\textrm\), and the STL for the loss function of the skeleton structure segmentation, denoted as \(L_\textrm\). The overall loss function is
$$\begin (\textbf, \textbf, }}, }})} =\,& (\textbf, \textbf)} + (}}, }})}\nonumber \\ =\,& (\textbf, \textbf)} + (}}, }})}, \end$$
(8)
where \(\textbf\) and \(\textbf\) represent the segmentation likelihood maps of the tubular structure and the corresponding ground truth, respectively. \(}}\) and \(}}\) represent the segmentation likelihood maps of the skeleton and the corresponding ground truth, respectively.
Comments (0)