PROACTING: predicting pathological complete response to neoadjuvant chemotherapy in breast cancer from routine diagnostic histopathology biopsies with deep learning

In this section, we first introduce the data used for the development and the validation of the multiple parts of this study, and then, we introduce the methodology used.

Clinical focus and definitionsBreast cancer subtypes

The primary focus of our study is on triple-negative (TNBC: HR−, HER2−) and ‘surrogate’ Luminal B (HR+, HER2−, grade 2/3) invasive breast cancers. As gene expression data and Ki67 were not available in our cohorts, we discriminated between ‘surrogate’ Luminal A and B based on the grade; this definition has been shown to provide chemotherapy benefit [27]. For the sake of compactness, in the rest of the paper, we will refer to “surrogate” Luminal B as simply Luminal B.

Additionally, we evaluated the developed biomarkers on an external public dataset from the IMPRESS study [24], which contains both TNBC and HER2+ cases.

Definition of pCR

We define here the pathological complete response to NAC as the absence of invasive cancer in the breast only (ypT0/is [26]). Focusing on breast only provides the closest readout when biopsies from the primary tumor in the breast are analyzed, yet providing sufficient predictive value to support treatment planning.

Method overview

Our approach consists of two parts, visualized in Fig. 1. First, we trained a CNN to segment the slides into the classes tumor, stroma, lymphocytes, necrosis, fat and rest. We also used an existing CNN model for mitosis detection developed by Tellez et al., previously validated in clinical studies [28, 29]. The output of this deep learning pipeline for a slide is a segmentation mask for the six classes and the coordinates of detected mitoses in the tumor regions. Second, we derived biomarkers from the tissue segmentation and mitoses detections and assessed their predictive value for pCR. In this section, we first introduce the used data and then the developed methods.

Fig. 1figure 1

Method overview: (1) Segment slides into different tissue types and detect mitoses. (2) Compute biomarkers from the segmentation prediction of tumor, stroma and lymphocytes and detected mitoses within tumor regions. LTR: lymphocyte-tumor ratio, cTILs: computational tumor infiltrating lymphocytes score, ITR: inflamed tumor ratio (proportion of tumor close to lymphocytes), MTR: mitoses-tumor ratio

Data

In this section, we first introduce the cohorts included in this study as well as the case inclusion and exclusion criteria. Based on that, we then describe the datasets used in the multiple phases of development and evaluation of the proposed work. In particular, we have defined (1) a dataset for the training of our segmentation algorithm on H&E slides; (2) a dataset for the development and tuning of the computational biomarkers; (3) an internal evaluation set; and (4) an external independent evaluation set. The data split is visualized in Fig. 2.

Fig. 2figure 2

Biomarker development and evaluation data: visualization of the data split per type (TNBC, Luminal B), center (NKI, RUMC+SCDC, IMPRESS) and data subset (development, evaluation), starting from the exclusion of cases due to quality (in gray) and for training of the segmentation model (in blue, part of \(}_^\)) to the definition of the development (in green, \(}_\)) and evaluation (in yellow, \(}_\)) datasets. Shown is also the additional IMPRESS [24] evaluation data (in orange, \(}_\)). Not included is the additional data for segmentation model training

Cohorts For model development and internal evaluation, we collected 926 cases from three European centers: 741 from the Netherlands Cancer Institute (NKI, Amsterdam, the Netherlands), 123 from the Radboud University Medical Center (RUMC, Nijmegen, the Netherlands) and 62 from the IRCCS Sacro Cuore Don Calabria Hospital (SCDC, Verona, Italy). All slides are diagnostic biopsies stained with H&E, extracted via core-needle procedure (before NAC). For NKI TNBC and the RUMC cases, multiple slides per case are available while the other cohorts have only one slide per case. In all cases, cohorts included both cases of Luminal B (defined as HR+, HER2−, grade 2/3) and triple-negative breast cancers (TNBC, defined as HR−, HER2−). For all cohorts, information about the NAC response was available; additional available clinical information (after exclusion) is listed in Tables 1 and 2.

Table 1 Clinical information for the TNBC cohorts per center (NKI, RUMC and SCDC)Table 2 Clinical information for the Luminal B cohorts per center (NKI, RUMC and SCDC)

Slides from NKI were obtained from retrospective studies and include old glass slides. Therefore, after digitization, slides were visually inspected by pathologists, who excluded 101 slides with washed-out staining or too few tumor cells. Slides from SCDC were checked by pathologists at the time of inclusion in this study, and the RUMC slides were scanned for the purpose of this study and visually checked for quality before and after scanning, resulting in no exclusion due to quality issues. All slides were digitized in the originating clinical center using multiple scanners. The NKI TNBC slides were scanned with an Aperio AT2 (Leica Biosystems) at 40X, the NKI Luminal B slides with a PANNORAMIC 1000 (3DHISTECH) scanner at 40X; the RUMC slides with a 3DHistech Pannoramic 1000 scanner at 40×; the slides from SCDC with a Ventana DP 200 slide scanner at 20× magnification.

For external evaluation, we used data from the public dataset recently published by Huang et al. [24] (IMPRESS). This cohort contains 64 TNBC cases and 62 HER2+ cases. The slides contain core-needle biopsies of breast cancer tissue samples, scanned at 20× magnification with a Hamamatsu scanner.

Development set for segmentation algorithm To train the multi-class tissue segmentation model, we assembled manually annotated cases from three different types of datasets, to form a development dataset. First, we used n = 110 biopsy cases from the NKI and the RUMC cohorts assembled within this project. In detail, we included breast biopsies from 89 NKI cases with n = 95 slides (82 TNBC, 13 Luminal B, as some cases have multiple slides), and from 15 RUMC cases, where one slide per patient was selected. Since these slides were used for training the segmentation model, they were excluded from the biomarker evaluation. Research assistants, instructed and supervised by pathologists, annotated small tissue regions on these slides as tumor, stroma, lymphocytes, necrosis, fatty tissue or rest/normal. Differentiating between tumor, stroma and lymphocytes is essential for the characterization of features of the TME, such as assessment of TILs, whereas the other classes were added for a more comprehensive tissue differentiation. An example of two annotations is shown in Fig. 3. Second, we included n = 92 slides from the public Breast Cancer Semantic Segmentation study (BCSS, [30]), with annotations for TNBC resections from TCGA [31]). These slides were densely annotated in regions of interest (i.e., all pixels in the ROI were labeled) with 18 different tissue types, which we mapped into the six targeted classes for consistency with the rest of the data.

Fig. 3figure 3

Segmentation and detection examples. On the top left is an example from a test slide with the segmentation overlay on the right. Predicted tumor is hued blue, necrosis magenta, lymphocytes purple, stroma orange and the rest green. The drawn polygons are the tissue annotations (red: Lymphocytes, black: Tumor). The slides were annotated using ASAP(https://github.com/computationalpathologygroup/ASAP). On the bottom are examples of kept (top) and filtered out (bottom) mitoses detections

Third, we included 73 slides from a RUMC cohort used in previous work to develop the “HookNet” model [32]. This dataset consisted of surgical resection slides which were manually annotated with sparse annotations of six classes of multiple tissue types.

Overall, 275 slides (165 resections, 110 biopsies) were used for model training, which we refer to as \(}_^\), and 74 slides (59 resections from the BCSS dataset and 15 biopsies from the NKI TNBC dataset) were used as test set to assess performance of the segmentation model, which we refer to as \(}_^\).

Development set for computational biomarkers We used data from 352 NKI cases (76 TNBC, 276 Luminal B) for the development and fine-tuning of computational biomarkers. Clinical and outcome data in terms of pCR were made available by the NKI. We used these data to design our computational biomarkers and fine-tune their parameters, e.g., choosing thresholds to maximize pCR prediction performance. We refer to this set as the \(}_\) dataset. It includes 15 slides with manual tissue annotations, which are also part of \(}_^\).

Internal evaluation set. We defined an internal evaluation set that contained 369 cases from NKI (66 TNBC, 133 Luminal B) and a combination of RUMC and SCDC cases, providing 170 cases in total (36 TNBC, 134 Luminal B). These cases were not used in any learning procedure, and the models’ predictions on them were evaluated externally by statisticians involved in this project only at the end of the fine-tuning phase of the computational biomarkers. We refer to this set as the \(}_\) dataset.

External evaluation set We also considered an external public dataset of breast cancer biopsies, recently published by Huang et al. [24] (IMPRESS). This cohort contains 64 TNBC cases and 62 HER2+ cases stained with H&E. Although HER2+ was not a subtype explicitly considered in the learning phase of our method, given the general applicability of the proposed PROACTING biomarkers, we validated their predictive value on this subtype as well. We refer to this set as the \(}_\) dataset.

Deep learning for tissue segmentation and mitosis detection

As the computer model for tissue segmentation, we chose U-Net [33], a CNN architecture for medical image segmentation. The details of the model and its hyperparameters are described in Additional file 1: Section S1.1. At test-time, every slide was pre-processed to exclude background and out-of-focus regions using a network that was previously developed and validated by Bándi et al. [34], therefore only producing a segmentation output for pixels belonging to the biopsy tissue.

The mitosis detection network had been previously presented by Tellez et al. [28] and was used off-the-shelf in this work. In brief, the network predicts the location of mitotic figures across the entire H&E slide. Since the network operates at 40× magnification, to apply the network to the SCDC dataset scanned at 20x, we first upsampled the slides to 40× using bilinear interpolation. Initial visual inspection of the mitoses predictions for slides from the \(}_\) set showed the presence of false positive detections outside of tumor regions. To address this issue, we combined the mitosis detection with the multi-class segmentation results and only kept mitoses surrounded by tumor at least 20 μm wide. This distance was determined empirically.

Computational biomarkers

The segmentation maps and mitosis detections from the deep learning pipeline allow to define biomarkers based on different counts and ratios of the predicted tissues. Based on hypothesis on the role of tissue compartments in the TME, we designed four morphologically interpretable biomarkers, which we refer to as the PROACTING biomarkers: three related to TILs and one related to mitotic count. The hyper-parameters for the biomarkers, such as values for distances and thresholds, were tuned empirically on the \(}_\) set to increase pCR prediction performance.

Computational TILs The biomarker cTILs (computational TILs) is aimed to emulate the visual estimation of stromal TILs as proposed by the International TILs Working Group [6]. To this end, the tumor bulk is determined by joining tumor regions within 100 μm clustering distance and creating an outlining envelope with a 50 μm margin around them. This is done via the morphological closing operation on the predicted tumor mask using a circular kernel with the clustering distance as radius. Then, the tumor mask is dilated by the margin distance (see Fig. 4 top). In the resulting tumor bulk, lymphocytes and stroma are counted:

$$\begin } = \frac} [}^]}}+} [}^]}. \end$$

(1)

Fig. 4figure 4

Visualization of the cTILs bulk (top) and the ITR radius (bottom) via blue polygons. In the overlays (right), tumor is hued blue, stroma orange, lymphocytes purple, necrosis magenta, fatty tissue yellow and the rest green

Tumor regions smaller than 0.1 mm2 were excluded from the tumor bulk formation to account for small wrong tumor predictions.

Lymphocytes to tumor ratio This biomarker measures the slide-global lymphocytes to tumor ratio (LTR):

$$\begin } = \frac} [}^]}}+} [}^]}, \end$$

(2)

where lymphocytes and tumor are the predicted area in mm\(^2\) for the corresponding tissue type from all cores containing tumor predictions.

Inflamed tumor ratio The ‘inflamed’ tumor ratio biomarker (ITR) measures the ratio of tumor near lymphocytes to the overall tumor amount:

$$\begin } = \frac}\, 80\,\upmu}\,} [}^]}} [}^]} \end$$

(3)

The value for the lymphocyte-tumor ‘interaction’ distance of 80 μm was chosen empirically. (An example is shown in Fig. 4 bottom.)

Mitotic rate The mitotic rate (MTR) measures the mitosis to tumor rate:

$$\begin } = \frac}}} [}^]}, \end$$

(4)

where mitoses is the number of detected mitoses inside the segmented tumor regions and tumor the amount of predicted tumor in \(mm^2\).

Handling multiple biopsies and cores Usually, a core needle biopsy procedure produces several cores. Only cores containing predicted tumor were considered for the biomarker computation, the rest was excluded. When multiple slides per case were present, the computational biomarkers were computed per case, as if all cores were present on a single slide.

Visual TIL scoring

To compare our PROACTING computational biomarkers with visual TIL-scoring according to the recommendations of the TIL Working Group [6], we set up reader studies for two pathologists to score the NKI TNBC (scored by JS and EM) and the NKI Luminal B cohorts (scored by EM and HMH) using the web-based platforms SlideScoreFootnote 1 and CIRRUS Pathology.Footnote 2 Pathologists were presented with a web view of a slide, where they could navigate the entire slide and inspect the tissue at different magnifications, but without access to the clinical variables. The pathologists could either give a score from 0 to 100 or mark the slide as not scorable. Only slides scored by both pathologists were used for biomarker development and evaluation, the rest was excluded (see Fig. 2). When multiple slides per patient were available, the slide-level scores were averaged to obtain a single case-level score. We refer to the averaged visual score as vTILs.

Evaluation and statistical analysis

In order to evaluate the predictive performance of the biomarkers for pCR, we calculated the area under the receiver operating characteristic curve (AUC) and performed multivariable logistic regression, always separately for TNBC and Luminal B. The AUC was computed for the NKI development and evaluation sets and the combined RUMC and SCDC cohorts. The provided p values were not corrected for multiple testing, since all tested biomarkers are based on validated knowledge of the biology of breast cancer.

The multivariable logistic regression was performed using the NKI evaluation sets only. The RUMC and SCDC cohorts had too small sample sizes and missing clinical information for proper multivariable analysis. All biomarkers were dichotomized based on their median, except for MTR which was dichotomized as 0 or >0, because approximately 60% had a value of 0. The clinical covariates age, grade, T-stage and N-stage were tested as confounding factors. For the MTR biomarker, grade was not tested as confounder, since the mitotic count is part of grading and therefore naturally correlated with grade. For Luminal B, numbers per category were too small in the evaluation set, so no adjusted ORs could be calculated. The covariates were categorized as follows: Age, ≤50 or >50; grade, 2 or 3; T-stage, 1+2 or 3+4; and N-stage, 0 or 1. A covariate was considered a confounder and added to the final multivariable logistic regression model if there was at least 10% change in odds ratio (Exp(B)). The statistical analyses were performed using IBM SPSS Statistical software version 27. The p values in the multivariable analysis were determined by Wald test per variable.

Comments (0)

No login
gif