A Longitudinal MRI-Based Artificial Intelligence System to Predict Pathological Complete Response After Neoadjuvant Therapy in Rectal Cancer: A Multicenter Validation Study

Neoadjuvant chemoradiotherapy (NCRT) followed by total mesorectal excision (TME) is the standard treatment for patients with locally advanced rectal cancer (LARC).1,2 After NCRT, about 15% to 27% of the patients could achieve pathological complete response (pCR). Considering the risk of postoperative complications after TME, organ preservation strategies such as watch-and-wait and local excision were proposed in meticulously selected patients with confidence of pCR to omit radical resection.3–5 Despite promising initial results,6 their wide adoption is hindered by the lack of tools to reliably predict pCR before surgery.7 Therefore, in this work, we aimed to develop and validate a model that could robustly predict pCR after NCRT.

MRI with diffusion-weighted images (DWIs) is a common clinical tool to assess local tumor invasion and nodal stage at diagnosis.8–10 However, its practical use in post-NCRT restaging is very challenging.11 Meta-analyses that included more than 1500 patients with rectal cancer reported disappointing performance of radiologists, with a pooled sensitivity of 19% for ypT0 prediction after NCRT.12,13 Specifically, the small amount of residual tumor cells overwhelmed by fibrotic, mucinous, and edematous tissues after NCRT often results in overstaging, whereas the dramatic shrinkage of the tumor bed might result in understaging.

Artificial intelligence (AI) approaches have been shown to achieve expert-level performance in diagnostic applications using medical imaging data.14–17 Different from traditional approaches,18,19 deep learning takes raw image data as inputs and integrates multiple processing layers to automatically learn in-depth features and build connections with diagnostic labels. We recently developed a novel multitask deep learning model, which incorporated a longitudinal comparison of MRI before and after NCRT, and found promising results for response prediction in a proof-of-principle study.20 Nonetheless, the model’s performance requires further validation in a real-world setting using data sets that exhibit high patient heterogeneity. In this work, complying with the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis statement advocated for deep learning studies,21,22 we refined and optimized the deep learning model and evaluated its performance in a large multicenter study.

PATIENTS AND METHODS Study Population

Consecutive patients with LARC were recruited from 3 hospitals in South China (from January 2013 to July 2019 at the Sixth Hospital of Sun Yat-sen University, Guangzhou [SYSU6H set]; from April 2016 to February 2018 at Cancer Center of Sun Yat-sen University, Guangzhou [SYSUCC set]; and from January 2016 to December 2017 at Nanfang Hospital, Southern Medical University, Guangzhou [NFH set]) and 1 hospital in Eastern China (February 2015 to December 2020 at the First Affiliated Hospital of Soochow University, Suzhou [SDFYY set]). The inclusion criteria were as follows: 1) MRI scans available 1 week before the start of NCRT and 6 to 8 weeks after the end of NCRT; 2) pretreatment clinical stage ≥T3, or N positive or mesorectal fascia invasion (MRF) involvement as assessed by MRI; 3) histopathologically confirmed rectal adenocarcinoma; 4) inferior tumor margins <15 cm above the anal verge as assessed by MRI; and 5) receipt of chemoradiation or chemotherapy-based neoadjuvant treatment before TME. The exclusion criteria were as follows: 1) patients with distant metastasis; 2) patients with histopathologically confirmed anal adenocarcinoma or anal squamous-cell carcinoma; and 3) patients with incomplete image data before and after neoadjuvant therapy or missing relevant clinical and pathological information. Patients from SYSU6H were followed up. All patients completed a written informed consent form before NCRT. The study was approved by the local medical ethics committee (2020ZSLYEC-075, B2020-763-01, NFEC-2020-756, 2021-042) and was conducted in accordance with the Declaration of Helsinki and good clinical practice.

Image Acquisition and Processing

All patients enrolled in this study underwent multiparametric MRI scans. Across the 4 centers, 1.5-T or 3.0-T MRI scanners manufactured by Philips, Toshiba, GE Healthcare, and United Imaging Healthcare were used to generate oblique axial pelvic MRIs with heterogeneous parameters (see Supplemental Table 1 at https://links.lww.com/DCR/C232). Cross-sectional MRI sequences included T1-weighted imaging with and without contrast, T2-weighted imaging, and diffusion-weighted imaging. Images in other parameters or in the coronal or sagittal plane were excluded. Gadolinium-based agents were used for acquiring contrast-enhanced T1-weighted MRI. DWIs were obtained with 2 b values: 0 and 1000 seconds/mm2. All MRIs were in the Dicom format. The regions of interest of tumors were manually delineated in all the T2 slices by 2 radiologists in consensus using itk-SNAP software (www.itksnap.org), and coregistered into other parametric images by rigid 3-dimensional (3D) registration. All the regions of interest were checked by a senior radiologist with 8 years of experience.

Development of the Deep Learning Model

The architecture of the deep learning neural network consists of 2 main parts: a Siamese subnetwork for feature extraction and segmentation and a NCRT response prediction subnetwork (Fig. 1B). The deep learning model takes the pre- and posttreatment multiparametric MRI as input for each patient and outputs both tumor segmentation and response prediction score. The manual annotation of tumor contour and dichotomized pCR status were used as ground truth. The segmentation subnetwork consists of 2 networks with the same structure and shared parameters and is used to segment pre- and post-NCRT MR images. Inspired by the 3D U-net,23 it includes a contraction path, an expansion path, and skip connections between the corresponding layers, all of which allow for an essential reorganization of feature maps at different scales throughout the depth network. The response prediction subnetwork uses extracted features from different depths: middle level in the contracting path, intermediate layer of the element-wise summation combination module at the end of the network, and the bottom of the U-shaped network. In this way, feature representations at different depths obtained by the deep learning model can be fully integrated to perform the pCR prediction task. More details about the model training are described in the study by Jin et al.20 Different from the original model trained from 321 patients, the model was updated by training from 638 patients.

F1FIGURE 1.:

Workflow diagram and the structure of paired MRI-based DeepRP-RC model. A, Workflow for the training and validation steps in this multicenter study. B, Using paired MRI before and after NCRT as inputs; the multitask deep learning network performs 2 tasks simultaneously: 1 for feature extraction and tumor segmentation and 1 for response prediction. The paired features extracted from pre- and post-NCRT MRI were depth-wisely convoluted into the response prediction network. DWI = diffusion-weighted image; LARC = locally advanced rectal cancer; NCRT = neoadjuvant chemoradiotherapy; NFH = Nanfang Hospital; pCR = pathological complete response; SDFYY = First Affiliated Hospital of Soochow University; SYSUCC = Cancer Center of Sun Yat-sen University; SYSU6H = Sixth Hospital of Sun Yat-sen University; T1+C = contrast-enhanced T1 weighted; T1W = T1 weighted; T2W = T2 weighted.

Neoadjuvant Treatment

In all the included cohorts, besides the ordinary 5-fluorouracil–based neoadjuvant chemoradiation,24 part of the patients were treated by oxaliplatin-based doublet chemotherapy (mFOLFOX625 or capecitabine and oxaliplatin) or triplet chemotherapy (FOLFOXIRI) combining irinotecan by 4 to 6 cycles.26 The long-course intensity-modulated radiation therapy was delivered at a dose of 1.8 to 2.0 Gy/day with 5 fractions per week for a total of 23 to 28 fractions over the course of 5 to 6 weeks and a total dose of 46.0 to 50.4 Gy. TME was performed 6 to 8 weeks after completion of radiation. After neoadjuvant therapy, a patient with a tumor located within the rectal wall (ycT0-2) by the MRI examination was defined as a good responder (c prefix indicates clinical stage before NCRT, yc prefix indicates clinical classification after NCRT). No patients received total neoadjuvant therapy. Adjuvant chemotherapy was prescribed at the discretion of the physician.

Pathological Assessment

Surgically, specimens were histopathologically assessed by an experienced pathologist and further reviewed by a senior GI pathologist, both of whom were blinded to the clinical outcomes. pCR was defined as the absence of any viable tumor cells and no lymph node metastasis. Poor response was defined as grade 3 by pathological tumor regression grade (TRG) measuring the proportion of tumor mass replaced by fibrosis according to the American Joint Committee on Cancer-TRG system.27

Comparison With Rater Evaluation

Random patients from the SYSU6H internal validation set, SYSUCC set, and NFH set were selected by simple random sampling. Both raters, who were experienced in rectal cancer diagnosis from SYSU6H, were blinded to the pathological results and independently evaluated the entire selected MRI set. The criteria for MR-pCR were no high signal intensity on T2-weighted imaging and no high signal on DWIs with a large b value (b = 1000 sec/mm2). After finishing the initial assessment, the subjective evaluation was repeated, as assisted by the results of DeepRP-RC.

Statistical Analysis

The predictive accuracy was evaluated by receiver operating characteristic analysis. Area under the curve (AUC) values were calculated using the Delong method. Sensitivity, specificity, positive and negative predictive values, and accuracy were acquired by selecting the cutoff at the maximum Youden index. Clinical parameters associated with pCR were selected by univariate and multivariate logistic regression. Significant associated factors were used to develop an integrated model in the training cohort by random forest algorithm, and its predictive values were tested in other cohorts. The interrater agreements were measured using the Cohens Kappa coefficient. In survival analysis, the SYSU6H set was separated into 2 groups using the cutoff for pCR prediction, and the non-pCR prediction group was further classified into 2 groups at the best operating point by the Youden method. Then, the 3-tiered DeepRP-RC–based groups were assessed using the Kaplan-Meier analysis and p values were obtained from a stratified log-rank test, and the HR was calculated from a Cox multivariate hazards regression after selecting other parameters by univariate regression using a threshold of p < 0.05. Statistical analyses were performed using R software (version 3.4.3).

RESULTS Baseline Characteristics of the Cohorts

A flowchart depicting the processes of this study is shown in Figure 1A, and the checklist of the modified transparent reporting of a multivariable prediction model for individual prognosis or diagnosis statement is illustrated in Supplemental Table 2 at https://links.lww.com/DCR/C232. A total of 112,596 MR images from 638 patients from the imaging database at SYSU6H were obtained for model training (SYSU6H training set). A total of 32,912 images from 186 patients from the same hospital were obtained for internal validation (SYSU6H validation set). Moreover, the external validation sets included 41,114 images from 235 patients for the SYSUCC set, 13,668 images from 79 patients for the NFH set, and 11,932 images from 63 patients for the SDFYY set. pCR occurred at a similar rate of less than 30% across all the cohorts.

Major variability existed among the cohorts in terms of age, tumor diameter, distance from the anal verge, MRF and extramural venous invasion (EMVI) before NCRT, clinical T/N stage before and after NCRT, neoadjuvant chemotherapy regimen, receipt of radiotherapy, pathological T/N stage, and TRG classification (Table 1). In addition, MRI manufacturers, magnetic field strength, and other scanning parameters also varied across the data sets (see Supplemental Table 1 at https://links.lww.com/DCR/C232).

TABLE 1. - Baseline characteristics in the training and validation cohorts Variable SYSU6H training set (N = 638) SYSU6H internal validation set (N = 186) SYSUCC external validation set (N = 235) NFH external validation set (N = 79) SDFYY external validation set (N = 63) p Sex, n (%)  Male 441 (69.1) 133 (71.5) 150 (63.8) 62 (78.5) 47 (74.6) 0.103  Female 197 (30.9) 53 (28.5) 85 (36.2) 17 (21.5) 16 (25.4) Age, y, n (%)  ≤50 224 (35.1) 68 (36.6) 49 (20.9) 26 (32.9) 17 (27.0) 0.001  >50 and ≤65 293 (45.9) 81 (43.5) 129 (54.9) 43 (54.4) 27 (42.9)  >65 121 (19.0) 37 (19.9) 57 (24.3) 10 (12.7) 19 (30.2) Tumor size, mm, n (%) 42.83 (15.30) 42.60 (14.36) 51.51 (18.62) 48.48 (14.79) 58.30 (12.63) <0.001 Tumor location, n (%)  Low 348 (54.5) 90 (48.4) 104 (44.4) 25 (31.6) 30 (47.6) 0.001  Medium/high 290 (45.5) 96 (51.6) 131 (55.6) 54 (68.4) 33 (52.4) Pre–T stage, n (%)  T2 32 (5.0) 11 (5.9) 15 (6.4) 2 (2.5) 1 (1.6) <0.001  T3 482 (75.5) 135 (72.6) 146 (62.1) 41 (51.9) 45 (71.4)  T4 124 (19.4) 40 (21.5) 74 (31.5) 36 (45.6) 17 (27.0) Pre–N stage, n (%)  N0 123 (19.3) 35 (18.8) 20 (8.5) 2 (2.7) 8 (12.7) <0.001  N1 245 (38.5) 70 (37.6) 39 (16.7) 3 (4.1) 17 (27.0)  N2 269 (42.2) 81 (43.5) 175 (74.8) 68 (93.2) 38 (60.3) Pre-EMVI, n (%)  Negative 613 (96.1) 140 (75.3) 106 (45.9) 43 (54.4) 48 (76.2) <0.001  Positive 25 (3.9) 46 (24.7) 125 (54.1) 36 (45.6) 15 (23.8) Pre-MRF, n (%)  Negative 494 (77.4) 138 (74.2) 113 (48.5) 40 (50.6) 42 (66.7) <0.001  Positive 144 (22.6) 48 (25.8) 120 (51.5) 39 (49.4) 21 (33.3) Pre-CEA, ng/mL 19.70 (84.17) 13.13 (56.53) 11.59 (23.35) 15.56 (25.40) 17.59 (72.98) 0.557 Neoadjuvant radiotherapy, n (%)  No 313 (49.1) 106 (57.0) 0 (0.0) 6 (7.6) 5 (7.9) <0.001  Yes 325 (50.9) 80 (43.0) 235 (100.0) 73 (92.4) 58 (92.1) Neoadjuvant chemotherapy, n (%)  5-FU 123 (19.3) 4 (2.2) 95 (40.4) 4 (5.1) 16 (25.4) <0.001  5-FU + OXA 395 (61.9) 136 (73.1) 138 (58.7) 75 (94.9) 47 (74.6)  5-FU + OXA + IRI 120 (18.8) 46 (24.7) 2 (0.9) 0 (0.0) 0 (0.0) Post–T stage, n (%)  T0 18 (2.8) 8 (4.3) 12 (5.1) 0 (0.0) 1 (1.6) <0.001  T1 38 (6.0) 2 (1.1) 1 (0.4) 0 (0.0) 4 (6.3)  T2 133 (20.8) 34 (18.3) 66 (28.1) 7 (8.9) 10 (15.9)  T3 379 (59.4) 117 (62.9) 123 (52.3) 67 (84.8) 41 (65.1)  T4 70 (11.0) 25 (13.4) 33 (14.0) 5 (6.3) 7 (11.1) Post–N stage, n (%)  N0 370 (62.0) 88 (55.7) 52 (22.9) 57 (74.0) 14 (22.2) <0.001  N1 174 (29.1) 52 (32.9) 96 (42.3) 15 (19.5) 34 (54.0)  N2 53 (8.9) 18 (11.4) 79 (34.8) 5 (6.5) 15 (23.8) Post-CEA, ng/mL, n (%) 11.91 (191.37) 6.10 (25.25) 2.89 (2.37) 2.55 (3.00) 3.03 (3.02) 0.910 pT stage, n (%)  T0 131 (20.5) 43 (23.1) 62 (26.4) 21 (26.6) 19 (31.1) <0.001  T1 41 (6.4) 14 (7.5) 8 (3.4) 3 (3.8) 3 (4.9)  T2 123 (19.3) 42 (22.6) 58 (24.7) 15 (19.0) 16 (26.2)  T3 340 (53.3) 87 (46.8) 103 (43.8) 31 (39.2) 23 (37.7)  T4 3 (0.5) 0 (0.0) 4 (1.7) 9 (11.4) 0 (0.0) pN stage, n (%)  N0 476 (74.7) 126 (69.6) 189 (80.4) 60 (75.9) 54 (85.7) 0.061  N1 126 (19.8) 43 (23.8) 29 (12.3) 15 (19.0) 8 (12.7)  N2 35 (5.5) 12 (6.6) 17 (7.2) 4 (5.1) 1 (1.6) TRG, n (%)  TRG0 130 (20.4) 42 (22.6) 61 (26.0) 20 (25.3) 19 (30.2) <0.001  TRG1 147 (23.0) 38 (20.4) 72 (30.6) 34 (43.0) 24 (38.1)  TRG2 288 (45.1) 81 (43.5) 83 (35.3) 17 (21.5) 10 (15.9)  TRG3 73 (11.4) 25 (13.4) 19 (8.1) 8 (10.1) 10 (15.9) pCR, n (%)  Negative 512 (80.3) 146 (78.5) 178 (75.7) 60 (75.9) 44 (69.8) 0.264  Positive 126 (19.7) 40 (21.5) 57 (24.3) 19 (24.1) 19 (30.2) MSI-H, n (%)  Negative 476 (74.6) 129 (69.4) 155 (66.0) 56 (70.9) 61 (96.8) <0.001  Positive 26 (4.1) 6 (3.2) 9 (3.8) 3 (3.8) 2 (3.2)  Missinga 136 (21.3) 51 (27.4) 71 (30.2) 20 (25.3) 0 (0.0)

P values in boldface highlight variables with significant baseline differences.

EMVI = extramural venous invasion; 5-FU = 5-fluorouracil; IRI = irinotecan; MRF = mesorectal fascia invasion; MSI-H = microsatellite instability-high; NFH = Nanfang Hospital; OXA = oxaliplatin; pCR = pathological complete response; SDFYY = First Affiliated Hospital of Soochow University; SYSUCC = Sun Yat-sen University Cancer Center; SYSU6H = The Sixth Affiliated Hospital of Sun Yat-sen University; TRG = tumor regression grade.

aPart of the MSI statuses were unavailable because of complete tumor response.


Performance of DeepRP-RC in pCR Prediction

The DeepRP-RC model consists of 2 subnetworks: 1 for tumor segmentation and 1 for response prediction (Fig. 1B). Tumor segmentation from the model was in good agreement with experts' delineation with a mean dice of 0.89 to 0.93, which were also similar to other deep learning–based segmentation tools such as 3D U-net23 and V-net28 (see Supplemental Fig. 1 at https://links.lww.com/DCR/C231).

DeepRP-RC achieved high performance in all the 4 validation sets, with AUC values of 0.969 (95% CI, 0.942–0.996) for the SYSU6H internal validation set, 0.946 (95% CI, 0.915–0.977) for the SYSUCC set, 0.943 (95% CI, 0.888–0.998) for the NFH set, and 0.919 (95% CI, 0.840–0.997) for the SDFYY set, respectively (Fig. 2A). Using the optimal operating points as the cutoff (0.507), sensitivity and specificity of the DeepRP-RC model were basically higher than 0.9 in all the validation sets. The positive and negative predictive values were, respectively, 0.907 (95% CI, 0.822–0.976) and 0.986 (95% CI, 0.966–1) for the SYSU6H internal validation set; 0.831 (95% CI, 0.753–0.914) and 0.977 (95% CI, 0.953–0.994) for the SYSUCC set; 0.826 (95% CI, 0.680–0.950) and 0.983 (95% CI, 0.946–1) for the NFH set; and 0.850 (95% CI, 0.714–1) and 0.955 (95% CI, 0.891–1) for the SDFYY set (Fig. 2B). Confusion matrixes showed that the true positive rates (ie, true pCR cases correctly predicted as pCR by the DeepRP-RC) were higher than 90%, and the false positive rates (ie, true non-pCR cases but falsely predicted as pCR) were less than 7% in all the validation sets (see Supplemental Fig. 2C at https://links.lww.com/DCR/C231). Figure 3A shows 2 examples of pCR cases correctly predicted by the DeepRP-RC model. This predictive accuracy could be supported by the close correlation between the continuous DeepRP-RC score and the TRG classifications, pathological T and N stages, in all the data sets (Fig. 3B). Notably, the most common reason for the false positive cases was the small number of residual tumor cells embedded in the mucin pool or fibrotic remnant, namely pathologically classified as TRG1 in 3 of 4, 9 of 11, 4 of 4, and 3 of 3 in the validation sets (see Supplemental Fig. 3 at https://links.lww.com/DCR/C231).

F2FIGURE 2.:

Performance of DeepRP-RC in predicting pCR after NCRT. A, ROC curves for the prediction of pathological complete response. B, PPV and NPV for pCR prediction in the training and validation cohorts. Error bars indicate 95% CIs. AUC = area under the curve; NFH = Nanfang Hospital; NPV = negative predictive value; pCR = pathological complete response; PPV = positive predictive value; ROC = receiver operating characteristic; SDFYY = First Affiliated Hospital of Soochow University; SYSU6H-T = The Sixth Affiliated Hospital of Sun Yat-sen University; SYSU6H-V = SYSU6H internal validation set.

F3FIGURE 3.:

Representative cases of pCR prediction by DeepRP-RC and its correlation with clinicopathological parameters. A, Two examples of pCR cases correctly identified by DeepRP-RC. The attention heat map of the model is displayed as a semitransparent overlay over the original MR images, in which the overlaid regions range from red (high attention and high diagnostic relevance) to blue (low attention and low diagnostic relevance). Patient 1: A 45-y-old woman with middle rectal cancer was staged as cT4N2b, MRF (+), and EMVI (+) by MRI before NCRT. Radiologist restaged as non-pCR after NCRT, whereas DeepRP-RC predicted as pCR by a score of 0.783. Patient 2: A 70-y-old woman with middle rectal cancer staged as cT3N2, MRF (+), and EMVI (+) by MRI before NCRT. Radiologist restaged as near-complete response, and DeepRP-RC predicted as pCR by a score of 0.983. Microscopically, both of the tumors were replaced by large amount of fib

留言 (0)

沒有登入
gif