In computer vision, an increasing amount of research points to the importance of convolutional neural networks. Properly trained convolutional neural networks have superior effects in image generation, image segmentation, and other aspects that surpass those of traditional computer vision-based processing methods. At the same time, convolutional neural networks can automatically extract the features of images through training. Based on these studies, we build a fully convolutional neural network with a network architecture that looks like the U-Net architecture that is often used in medical image segmentation. An encoder first downsamples the original 1-channel SUV images. Then, the high-level semantic information of the image is encoded through a series of convolutional or pooling operations to obtain an image feature vector. This feature vector is then sent to a decoder, which returns the information the encoder takes out. This process eliminates noise, which is harder to learn or fit into the network than a useful signal. We change the batch normalization operation in the network to a group normalization operation and add skip connections, similar to those in the residual network, to speed up the training process and improve the quality of the generated images. To be more specific, a basic block called a DoubleConv block makes up many other blocks. The GroupNorm layer and the activation layer come after the two convolutional layers in the DoubleConv block. The rectified linear unit (ReLU) function is chosen as the activation function. The number of channels per group is set to 16 for GroupNorm. The encoder comprises one DoubleConv block and four DownConv blocks that are made up of one maximum pooling layer and one DoubleConv block. The Maxpool layer’s function is to perform downsampling by a factor of 2. The first DoubleConv block maps the channel size of the input image to the target channel size for the subsequent calculations. The target channel sizes of the blocks are set to 64, 128, 256, 512, and 1024, which means that the output feature image size is 1/16 of the original image size. The decoder then takes the last feature image to perform upsampling 4 times using UpConv blocks. Each UpConv block comprises one transposed convolutional layer and one DoubleConv block. For each block at the same level, skip connections are made between the encoder and decoder. The architecture of the feature extraction network is shown in Fig. 1. The dimensionality of the input is described in the “Training Setup” section, and the flow of data from one network to the other is shown in Fig. 2. In Fig. 2, we refer to the following kinetic model network as a pointwise neural network (Fig. 3).
Fig. 1The architecture of the feature extraction network. To demonstrate the effectiveness of our approach, we did not make many structural improvements to U-Net
Fig. 2The relation between the feature extraction network and the kinetic model network (pointwise neural network). We obtained each frame’s feature maps using U-Net. All 220 feature maps were then input into a pointwise neural network, which was implemented by a stack of convolutions with a kernel size of 1. Therefore, each voxel was a 220-dim vector that was fed into the kinetic model network
Fig. 3The architecture of the kinetic model network. Each rectangle with different colors surrounding a group of neurons represents a hidden layer with different numbers of neurons, and layers with the same color possess the same number of neurons
Kinetic model networkThe physiological system of dynamic processes in the tissue of interest is decomposed into several compartments, which interact with each other. In PET, tracer kinetic modeling is based on compartmental analysis. Ordinary differential equations (ODEs) continuously and deterministically represent the compartmental system. Each equation describes the temporal rate of change exhibited by the material in a compartment. These rates of change are controlled by the physical and chemical rules that govern how materials move from one compartment to another. These rules include diffusion, temperature, and chemical reactions [18]. The vast majority of articles use the 2-tissue compartment model (2TCM) with the Patlak method to analyze dynamic PET images. Since most researchers have looked into the 2TCM and found that it works [7], our method also builds the network on the 2TCM. The ODEs of the 2TCM are described as follows:
$$\frac}C_ (t)}}}t}} = K_ C_ (t) - (k_ + k_ )C_ (t) + k_ C_ (t)$$
(1)
$$\frac}C_ (t)}}}t}} = k_ C_ (t) - k_ C_ (t)$$
(2)
where \(_\) is a constant that represents the rate of influx from plasma to tissue, \(_\) is a constant that represents the rate of efflux from the first compartment in the 2TCM, \(_\) is the rate of transfer from a nonspecific compartment to a specific compartment in a reversible or irreversible 2TCM, and \(_\) is the rate of transfer from a specific compartment to a nonspecific compartment in the reversible 2TCM. To increase the complexity and diversity of the TACs generated by the network, we do not fix \(_\) as 0. However, the network is capable of generating TACs when \(}_\) is equal to 0. \(_(t)\) is the input blood function, \(_\left(t\right)\) is the concentration of the nondisplaceable compartment, and \(_\left(t\right)\) is the concentration of the binding radiotracer in the specific compartment; the tissue concentration \(_}(t)\) is the sum of the nondisplaceable and specific compartment concentrations. [19].
The solution of these ODEs is the convolution of an exponential function with the input function. The equations are as follows:
$$C_}} (t) = aC_ (t) \otimes e^ t}} + bC_ (t) \otimes e^ t}}$$
(3)
$$\begin \alpha_ = & \left( + k_ + k_ - \sqrt + k_ + k_ } \right)^ - 4k_ k_ } } \right)/2 \\ \alpha_ = & \left( + k_ + k_ + \sqrt + k_ + k_ } \right)^ - 4k_ k_ } } \right)/2 \\ a = & K_ \left( + k_ - \alpha_ } \right)/\left( - \alpha_ } \right) \\ b = & K_ \left( - k_ - k_ } \right)/\left( - \alpha_ } \right) \\ \end$$
(4)
The total activity concentration (e.g., in nCi/ml) for a voxel at a given time is denoted by
$$C_}}} \left( ,t} \right) = \left( } \right)C_}} (t) + f_ C_}}} (t)$$
(5)
where \(}_}\) represents the parameters of the kinetic model. The volume fraction of a voxel that is made up of blood is denoted by the constant \(_\). \(_}\)(nCi/ml) is the concentration of tracer activity in whole blood (i.e., plasma plus blood cells plus other particulate matter) [20].
Our method uses the blood input function \(_\left(t\right)\) as the whole blood function \(_}(t)\). We form the kinetic model network, a convolutional neural network with 1 × 1 convolutional layers that let each voxel be computed separately while reducing the number of parameters and increasing the training speed of the network. We applied the feature extraction network to each individual time frame image of the dynamic PET data, extracting 10 feature maps for each time frame. With a total of 22 time frames, there are 220 feature maps in total. This means that each voxel is represented by a 220-dimensional vector, as shown in Fig. 2. The feature extraction network's output feature vectors are fed into the kinetic model network. Moreover, the kinetic model network predicts the five parameters (\(_,_,_,_,_)\) of the 2TCM for each voxel based on the inputs. After obtaining the parameters generated by the network, we can use these parameters to obtain the whole TAC through the 2TCM. Once we have obtained the time activity curves for each voxel, we can calculate the dynamic PET image at any desired time frame. To be more specific, the intensity of the image at pixel j in time frame m, \(_(_)\), is determined by
$$x_ \left( } \right) = \mathop \smallint \limits_ }}^ }} C_}}} \left( } \right)e^ }\tau$$
(6)
where \(_\) represents the starting time of frame m, \(_\) represents the ending time of frame m, and \(\lambda\) represents the decay constant of the radiotracer. \(_}\left(t;_\right)\) denotes the tracer concentration in pixel j at time t, which is determined using the aforementioned kinetic model with the parameter vector \(_\in ^_\times 1}\).
Then, we can estimate Ki using the Patlak plot method:
$$\frac_(t)}=_\frac_^_\left(\tau \right)\mathrm\tau }_(t)}+_$$
where \(_\) is the constant rate of irreversible binding. \(_\) is the distribution volume of the nonspecifically bounded tracer in the tissue. \(x(t)\) is the integrated activity of the tissue up to time t. \(_(t)\) is the plasma concentration of the tracer at time t.
Training setupOnly the dynamic images obtained during the first thirty minutes were fed into the whole neural network. All the inputs were normalized to SUV images. The input matrix had a shape of \(_\) ×1 × H × W, where \(_\) corresponds to the total number of time frames within the initial thirty minutes. H and W denote the height and width of the image, respectively. The number of input channels was specified as 1. The size of the output matrix, representing the whole network's output, was \(T\times H\times W\), where T represents the number of time frames in the dynamic PET images. Furthermore, the loss function was the Huber loss [21], which is very resistant to outliers.
To train the kinetic model network, we calculated the loss between the generated images and the ground truth as the loss function.
$$}_}}} = \frac\sum\limits_ }(}_}}}^ (x,y),gt_}}}^ (x,y))} }$$
(7)
where \(}_}^(x,y)\) is the pixel value of the generated image at position (x,y) for the t-th frame. \(}_}^(x,y)\) is the pixel value of the ground truth at position (x,y) for the t-th frame. T is the total number of frames. M and N are the height and width of the images, respectively. Due to the fact that the first 30 min of dynamic PET images (the first 22 frames) were already used as network inputs, we only utilized the images from the subsequent 30 min (last 6 frames) as the training targets.
Additionally, we added a time difference loss function for the linear part of the Patlak model.
$$\begin y(t,x,y) \triangleq & \frac}_}}} (t,x,y)}} (t)}} \\ x(t) \triangleq & \frac^ (\tau )}\tau } }} (t)}} \\ }_ (f(t,x,y)) \triangleq & f(t_ ,x,y) - f(t_ ,x,y) \\ }_}}} = & \frac}}} MN}}\sum\limits_ }(}_ (y(t,x,y)),}_ (K(x,y)x(t))} } ) \\ \end$$
(8)
where \(_(t)\) is the blood input function. \(K(x,y)\) is the Ki parameter of the Patlak plot at position (x,y). x(t), y(t), and \(}_(\cdot )\) are defined according to the definitions provided in Eq. (8). \(_}\) is the total number of frames that represent the linear portion in the Patlak plot model. \(_\) represents the kth time frame.
Thus, the total loss is as follows:
$$} = }_}}} + \lambda \times }_}}}$$
(9)
where \(\lambda\) is a hyperparameter that adjusts the weight between fitting Ki and the SUV.
We did not train the feature extraction network and the kinetic model network separately; instead, we treated them as one end-to-end network and trained them together. The optimizer was chosen as adaptive moment estimation (Adam) [22], and the learning rate was set to 1e-4. We used a strategy that adjusted the learning rate to one-tenth of the original value every 10,000 iterations, with a lower bound of 1e-7. We trained our network on an NVIDIA GeForce RTX 3090 GPU for a total of 10 epochs. Each epoch included 7171 iterations. To validate the effectiveness of our proposed method that incorporates a kinetic model, we compared it to a method with the exact same network structure but without the kinetic model. In other words, we directly predicted the SUV images for the last 30-min time frames without the need to perform the steps of the kinetic model. Additionally, while maintaining the rest of the network architecture unchanged, we removed the sigmoid activation function from the final layer. We are still employing a point-wise neural network approach. We refer to this method as "without model" in the figure. The method without incorporating the kinetic model adopted the same hyperparameter settings and loss function as the full model. This was done to minimize the influence of other factors and ensure the accuracy of the conclusions.
Patient PET dataThe network's training dataset was obtained from the Cancer Hospital of the Chinese Academy of Medical Sciences Shenzhen Center, which included 7313 slices of data from 103 patients acquired with the GE Healthcare Discovery MI Dr PET/CT Scanner. All patients had space-occupying lung lesions, which can also be called pulmonary nodules. Both benign and malignant lesions were present. We randomly selected 10 patients as the test set and 93 patients as the training set. The patient's height range was 1.641 m ± 0.089 m, and the weight range was 63.0 kg ± 10.36 kg. Information on the patient's gender and age were unavailable because the patient's data were anonymized and desensitized. The dynamic PET data were divided into 28 frames: 6 × 10 s, 4 × 30 s, 4 × 60 s, 4 × 120 s, and 10 × 300 s with total radionuclide doses of \(}^\)-FDG ranging from 201.83 Mbq to 406.46 Mbq for different patients. Each time frame of the dynamic PET data was an image array of 256 × 256 × 71 voxels with a voxel size of 1.95 × 1.95 × 2.79 mm3. The blood input function was manually extracted from the image region of the descending aorta.
Comments (0)