Towards deep-learning (DL) based fully automated target delineation for rectal cancer neoadjuvant radiotherapy using a divide-and-conquer strategy: a study with multicenter blind and randomized validation

Data collection and preparation

As a plot study approved by the institutional review board a cohort of 141 patients treated at our institute (Peking University Cancer Hospital) between March 2020 and May 2022 were retrospectively included in this study. The patients were diagnosed with Stage II to III rectal cancer, and received neoadjuvant chemoradiotherapy, which is the standard treatment for locally advanced rectal cancer. The cohort were categorized into training group (121/141) and testing group (20/141) by random sampling (Fisher-yates shuffle). Over the training group, 69 were female and 52 were male, and the ages ranged from 33 to 74 with the median value of 61. Over the testing group, 13 were female and 7 were male, and the ages ranged from 39 to 72 with the median value of 63.

All the patients were immobilized by pelvis thermoplastic in the supine posture and received CT and MRI simulations respectively. The CT images were acquired on a big-bore RT-specific CT scanner (Somation Sensation Open, Siemens Healthineers, Germany) with 5-mm slice thickness, and MRI T2 and T1 images on a 3.0-T MR-Sim scanner (MAGNETOM Skyra, Siemens Healthineers, Germany) with 5-mm slice thickness as well. The CT and MRI images were imported into the Eclipse Treatment Planning System (Varian Medical System Inc., USA) for target and organs-at-risk (OAR) delineation. Due to the different imaging contrast properties between CT and MRI, CTV structures were contoured in CT images, and GTV structures in MRI T2 images. Note that the CTV and GTV definitions in this study were consistent with the NCCN and ESMO guideline. The CTV and GTV contours of all the patients were reviewed by two senior physicians, and therefore were used as ground truth (GT) herein.

DL model for CTV and GTV segmentation

The kernel DL network herein were DpnUnet, a highly capable network that demonstrates impressive performance in segmentation tasks with fuzzy boundaries, and validated in cervical cancer for CTV and OAR segmentation [19]. It is important to note, since CTV and GTV structures were contoured in two disparate image domains, CT and MRI, we proposed to use a divide-and-conquer strategy: two DL models were built respectively to take in CT images for CTV segmentation and MRI images for GTV segmentation respectively and specifically. Despite the identical network architecture, the input, network parameters (weights) and output were completely different.

DpnUnet architecture

The DpnUnet was a Unet variant characterized by the typical U-shape encoder-decoder design and locally integrated with dual-path-network (DPN) modules. The overall architecture of the DpnUnet network was illustrated in Fig. 1. Briefly, the original U-net encoder part was replaced with the DPN92 model, and the decoder part embedded the micro-blocks in the DPN92 network to achieve paralleled performance in abstract feature recovery. The input layer took in 3 adjacent slices (empirical value) to incorporate 3D anatomy context, and the network delivered the predicted regions-of-interest (ROIs) in the middle slice. Generally, the DpnUnet network was an end-to-end segmentation framework that could achieve pixel-wise labeling in both CT and MRI images. Once the two models were trained, regions of CTV and GTV were auto-segmented slice by slice.

Fig. 1figure 1

Schematic of the kernel DpnUnet network architecture

Model training

The training processes of CTV and GTV DpnUnet networks were the same but independently with different training data. There were 121 patient cases in the training group. The DpnUnet networks were trained with 11-fold (10:1) cross-validation, 110 cases for training and the other 11 cases for validation. Generic data augmentation techniques including flip, and translation, rotation were used. The networks were implemented by PyTorch 1.12.0 and Python 3.6, and trained on a NVIDIA P100 GPU (16 GB memory). Both of the CTV and GTV kernel networks were initialized with a pre-trained network that had been trained for OAR segmentation in cervical cancer CT images [20]. The optimizer was Adam. The learning rate was initialized as 0.0001 and decayed by an exponential function with gamma 0.9 for every epoch. The total epoch number was 100 with the batch size as 4, and the model with the lowest validation loss was selected as the output for further testing. The optimizer, learning rate and batch size were also the same for both CTV and GTV model training.

Performance evaluation

There were 20 patient cases in the testing group. We adopted the three-level evaluation design proposed by [14] to assess both the CTV and GTV DL model performance in more aspects than one. The evaluation procedure was depicted in Fig. 2. The Level-1 evaluation focused on objective metrics, and the Level-2 and Level-3 focused on oncologists’ subjective assessment of clinical viability. Moreover, to enhance the generalizability of subjective evaluation of the proposed method, we invited 8 senior radiation oncologists from 8 different cancer centers to score contours blindly and independently.

Fig. 2figure 2

3-Level evaluation design for DL-based CTV and GTV auto-segmentation

Level 1: quantitative metrics based objective evaluation

The Dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (95HD) [21] were used in Level-1 to quantify the contouring accuracy. The DSC index defined in Eq. (1) was to measure the relative volumetric overlap between two contours and the value equals 1 when two contours are completely the same.

$$\mathbf\mathbf\mathbf=\frac\left|\mathbf\cap\mathbf\right|}\right|+\left|\mathbf\right|}$$

(1)

Where P and G represented the predicted and ground contours respectively, and |P∩G| represented the volume that P and G intersected.

The 95HD index defined in Eqs. (23) was to reflect the overlapping between two contours by mismatching distance, and higher distance value indicates larger contour difference.

$$\mathbf\mathbf,\mathbf\mathbf=\mathbf\mathbf\mathbf\cup\mathbf\mathbf\ \mathbf)$$

(2)

$$\mathbf(\mathbf,\mathbf)=\mathbf(\mathbf\mathbf),\mathbf$$

(3)

Where ||.|| is the Euclidean norm of the points of p and g.

The DSC and 95HD values were calculated in each testing case, as well as the mean and standard deviation (SD) over the entire testing group.

Level 2: blind & randomization expert scoring

Ten out of the 20 testing patients were randomly selected by Fisher-Yates shuffle for Level-2 evaluation. For CTV evaluation, five patients were randomly selected, and for each patient five CT slices were selected to display GT contours (CTV-GT: 5 × 5 = 25), and it went likewise with the rest 5 patients to generate CT slices with DL contours (CTV-DL: 5 × 5 = 25 slices in one folder). Similarly, five MRI slices for five randomly selected patients were randomly extracted to display GT GTV contours (GTV-GT: 5 × 5 = 25) and five MRI slices for the rest 5 patients to display DL GTV contours (GTV-DL: 5 × 5 = 25 slices in another folder). The DICOM-RT slices were exported as non-compressed TIFF images. In total, two folders of 50 images for CTV (GT = 25, DL = 25) and GTV (GT = 25, DL = 25) evaluation were prepared. The images in each folder were reshuffled (in Python) and anonymized by ordering numbers each time before we sent them to an external expert for independent scoring.

The rubric for scoring was in grade (Table 1): 3 for Accept, 2 for Minor Revision, 1 for Major Revision and 0 for Reject. The scores ≥ 2 were defined to be viable for clinical application. In addition, the scores in the GT and DL groups were statistically compared by Mann-Whitney U-test (significant level: p < 0.05).

Table 1 Rubric for expert scoring Level 3: blind & randomization based head-to-head turing test

The rest ten testing patients were used for Level-3 evaluation. For each testing patient, five CT slices were randomly selected to display both CTV-GT and CTV-DL contours simultaneously (CTV = 10 × 5), and likewise five MRI slices to display both GTV-GT and GTV-DL contours (GTV = 10 × 5). In total, two folders of 50 slices for CTV and GTV evaluation were prepared. The DL and GT contour colors (red/green) in each image was randomized (Fisher-yates shuffle), and the images in each folder were reshuffled by random.shuffle() in Python and anonymized by ordering numbers, each time before the dataset was distributed along with the Level-2 dataset.

For each testing image, external experts were required to choose the optimal contour (positive) for clinical application. The positive rates of CTV-DL and GTV-DL contours were calculated, and the threshold for passing the Turing test was 30%, an empirical value [22].

留言 (0)

沒有登入
gif