In this study, we enrolled a total of 104 patients, among whom positive Reg IV expression was observed in breast cancer tissues in 60 patients (57.5%). The rate of Reg IV expression varied across different types of breast cancer, with rates of 61.3% (13/21) in triple-negative breast cancer (TNBC), 45.8% (22/48) in luminal type, and 71.4% (25/35) in HER-2 positive type. Among Reg IV-positive patients, approximately 48.3% (29/60) achieved a pCR in NACT, whereas the likelihood of Reg IV-negative patients achieving pCR was around 20% (8/40). The patients were stratified into training and validation datasets in a randomized manner, following a 7:3 ratio (refer to Table S1 for specific details). Subsequently, the patients were categorized based on whether they received NACT treatment, and individual indicators were analyzed between groups within different datasets. The results indicated that the overall characteristics of the data were comparable across the various datasets (Table 1). We calculated the positive expression of Reg IV protein in each type and found that a total of 21 patients were included in TNBC, of whom 13 were Reg IV positive, accounting for 61.9%; among 35 patients with luminal, 25 were Reg IV positive, accounting for 71.4%. Among the HER-2 positive patients, 22 were Reg IV positive, accounting for 45.8%. Data analysis showed that there were differences in the expression of Reg IV in different subtypes, especially in luminal and HER-2 positive types.
Analysis of correlation and covariance between variablesCluster analysis revealed significant clustering in ER, PR, and type, as well as in Treatment and HER-2, Age and Menstrual status, and T-stage and Clinical stage. This indicates the presence of high-dimensional data. Subsequently, an exploration of the variance inflation factor (VIF) between the individual variables was conducted, indicating a generally fair level of covariance. However, it was noted that the VIF of the type exceeded 5, and the VIF of ER was close to 5, indicating the need for further screening of the included variables (refer to Fig. 1A, B). We further explored the correlation between pCR and Reg IV, and the results were rho = 0.311, p value = 0.001304, indicating a statistically significant correlation between the two.
Fig. 1A Significant clustering is observed in ER, PR, and the type, particularly between Treatment and HER-2, Age and Menstrual status, as well as T_stage and Clinic_stage. B The VIFs between variables indicate that overall collinearity is fair. However, the VIF of Type slightly exceeds 5, and the VIF of ER approaches 5. C The AUC in the training dataset for the model based on linear logistic regression is 0.822 (0.711–0.933). D The AUC in the training dataset for the model based on logistic regression is 0.909 (0.812–1.000). E, F In the training dataset, the new model exhibits a Dxy value of 0.643, an R2 value of 0.377, a Brier value of 0.153, and a C-index value of 0.822. In the test dataset, these values are Dxy 0.817, R2 0.516, Brier 0.137, and C-index 0.909, respectively
ModelingThe model Lm was established through univariate logistic regression analysis and the AUC. Variables with p < 0.05 and an AUC greater than 0.6 were included for multivariate analysis (Table 2) (Figure S2). The AUC for model Lm in the training dataset was 0.822 (0.711–0.933), and in the test dataset, it was 0.909 (0.812–1.00). The calibration curve demonstrated that model Lm exhibited good predictive performance and stability (Fig. 1C, D), with a Dxy value of 0.643, an R2 value of 0.377, a Brier value of 0.153, and a C-index value of 0.822 in the training dataset, and a Dxy value of 0.817, an R2 value of 0.516, a Brier value of 0.137, and a C-index value of 0.909 in the test dataset (Fig. 1E, F).
Table 2 Univariate and multivariate analysis and ROC analysisThe RF model was trained in the training dataset using the random forest method, resulting in an AUC of 0.702 (0.556–0.848) in the training dataset and 0.833 (0.659–0.990) in the test dataset (Fig. 2B, C). The prediction performance of the random forest method was slightly lower than that of the logistic regression method in the training dataset, and similar in the test dataset. We calculated the weights of the variables in the random forest model, and the top five variables were included to build model 1, which consisted of HER-2, ER, Type, Treatment, and Reg IV (Fig. 2D). The AUC for model 1 in the training dataset was 0.829 (0.721–0.937), and in the test dataset, it was 0.881 (0.758–1.00) (Fig. 2E, F). The calibration curve for model 1 revealed that in the training dataset, the Dxy value was 0.658, the R2 value 0.407, the Brier value 0.149, and the C-index value 0.829 (Fig. 2G), while in the test dataset, the Dxy was 0.762, the R2 0.483, the Brier 0.142, and the C-index 0.881 (Fig. 2H). We further explored the imbalance in the data. We randomly sampled the data with the method of tenfold sampling, and analyzed the relationship between Reg IV protein and PCR for each resampled sample. Finally, fixed- or random-effect models were used to combine the data. However, whether in a single sample of data or after the combination, Reg IV protein positive was beneficial to obtain PCR for neoadjuvant chemotherapy (Fig. 2I).
Fig. 2A The random forest error consistently decreases with an increasing number of trees. B, C The random forest model exhibits an AUC of 0.702 (0.556–0.848) and 0.833 (0.659–0.990) in the training and test dataset. D The variable importance ordering in random forests. E, F The AUC for model 1 is 0.829 (0.721–0.937) in the training dataset and 0.881 (0.758–1.00) in the test dataset. G, H The calibration curve for model 1 indicates that in the training dataset, the Dxy value was 0.658, the R2 value 0.407, the Brier value 0.149, and the C-index value 0.829, while in the test dataset, the Dxy was 0.762, the R2 0.483, the Brier 0.142, and the C-index 0.881. I Meta-integration of tenfold resampling data through random-effect model and fixed-effect model suggested that Reg IV protein expression was beneficial to PCR for neoadjuvant chemotherapy
The Xgboost method was used to train the model, resulting in an AUC of 0.980 (0.958–1.000) in the training dataset (Fig. 3A). However, the performance of the model in the test dataset was suboptimal, with an AUC of 0.603 (0.042–0.786) (Fig. 3B). Given this observed inadequacy in the validation set, the model was interpreted using SHAP to assess the importance of variables, and the top five variables in terms of importance were selected to establish model 2. These variables included HER-2, ER, T_Stage, Reg IV, and Treatment (Fig. 3C–E). The AUC for model 2 was 0.837 (0.734–0.941) in the training dataset and 0.897 (0.775–1.00) in the test dataset (Fig. 3F, G). The calibration curve for model 2 indicated a Dxy value of 0.675, an R2 value of 0.428, a Brier value of 0.147, and a C-index value of 0.837 in the training dataset (Fig. 3H). In the test dataset, the calibration curve revealed a Dxy value of 0.794, an R2 value of 0.463, a Brier value of 0.147, and a C-index value of 0.897. (Fig. 3I).
Fig. 3A, B The AUC for the training model based on Xgboost in the training dataset is 0.980 (0.958–1.000), and the AUC for the validation model in the test dataset is 0.603 (0.042–0.786). C–E The model was interpreted using SHAP to assess the importance of variables, and the top five variables included in model 2 are HER-2, ER, T_Stage, Reg IV, and Treatment. F, G Model 2 exhibits an AUC of 0.837 (0.734–0.941) in the training dataset and 0.897 (0.775–1.00) in the test dataset. H, I The calibration curve for model 2 indicates a Dxy value of 0.675, an R2 value of 0.428, a Brier value of 0.147, and a C-index value of 0.837 in the training dataset. In the test dataset, the Dxy value is 0.920, the R.2 value is 0.770, the Brier value is 0.072, and the C-index is 0.897
Comparison of models and their validation, and display of nomograms, DCA, and CIC curvesThe three models, including the logistic regression-based model (Lm), were established, and while there was no significant difference in the AUC, the included variables were not entirely consistent. Further comparisons between the three models were conducted using NRI and IDI in both the training and validation sets. In the training set, comparing Lm with model 1 yielded NRI: 0.08 [−0.069–0.229], p value: 0.29266; IDI: 0.0255 [−0.0187–0.0697], p value: 0.25861, indicating no statistically significant difference between the two models (Fig. 4A). Similarly, comparing Lm with model 2 resulted in NRI: −0.0252 [−0.275–0.2246], p value: 0.84315; IDI: 0.0356 [−0.0193–0.0906], p value: 0.20367, with no statistical difference (Fig. 4B). When comparing model 1 with model 2, the results were NRI: −0.1052 [−0.3208–0.1104], p value: 0.33875; IDI: 0.0102 [−0.0308–0.0512], p value: 0.62684, indicating that model 1 was slightly inferior to model 2, but with no statistically significant difference (Fig. 4C). In the validation set, comparing Lm with model 1 resulted in NRI 0.2381 [−0.0849–0.5611], p value: 0.14851; IDI: 0.0477 [−0.0649–0.1603], p value: 0.40603, indicating no statistically significant difference (Fig. 4D). The comparison between Lm and model 2 revealed NRI: 0.3333 [0.0078–0.6589], p value: 0.04477; IDI: 0.2671 [0.1004–0.4338], p value: 0.00168, indicating that model 2 was superior to Lm (Fig. 4E). Additionally, the comparison between model 1 and model 2 indicated NRI: −0.1786 [−0.3791–0.022], p value: 0.08094; IDI: −0.2194 [−0.3498 to −0.089], p value: 0.00098, signifying that model 2 outperformed model 1 (Fig. 4F). Based on the analysis, model 2 was selected as the best-performing model and was visualized. We created a nomogram based on model 2 in the full dataset (Fig. 4G). The final DCA and CIC curves reflected the excellent performance of the model in clinical applications (Fig. 4H, I).
Fig. 4A–C represent the pairwise comparisons of model 1, model 2, and model Lm in the training dataset, indicating no significant differences. D–F are pairwise comparisons of model 1, model 2, and model Lm in the test dataset. Model 2 is superior to model Lm, and there is no significant difference between model 1 and model 2. G presents nomogram based on model 2 in the full dataset, while H and I are the DCA and CIC curves of model 2, respectively
Comments (0)