Boosting predictive accuracy in tumor cellularity evaluation with AI-powered ensemble methods

2.1 Dataset

The study cohort encompassed individuals diagnosed with either ductal adenocarcinoma or adenocarcinoma of the pancreas and colon. The dataset is available from PAIP2023 [17], which provided data in three phases: training, validation, and testing, with mask annotations available only in the training phase. This study leveraged 53 images from the training phase of the PAIP2023 to develop deep learning methods for TC analysis. Additionally, the testing phase of the PAIP2023 comprised 20 pancreas images and 20 colon images, which were used for external testing in this study. High-resolution images require significant memory, especially in deep learning where multiple layers process the images, so this study processed the original image 10242 provided by the PAIP2023 into a final size of 5122. The total number of processed images and mask annotations is provided in Table 1.

Table 1 Shows the data distribution across the training, validation, and testing in this study, derived from the PAIP2023 training phase. Here, the testing set is for internal use only. I represents the number of processed images, T represents the number of annotated tumor cells, and N represents the number of non-tumor cells

Notably, each image featured a resolution of 10242 pixels, with microns per pixel (MPP) varying from 0.263 mm to 0.502 mm. The images were annotated to delineate the boundaries between tumor and non-tumor cells, with tumor cellularity (TC) values ranging from 0 to 100. These details are illustrated in Fig. 1 and described in Eq. (1).

Fig. 1figure 1

The sample displayed the images and annotations, in which the yellow denotes tumor cells and blue denotes non-tumor cells, and the corresponding TC value was given. It is noted that TC represents the Tumor cellularity (TC) value of an image, T represents the number of tumor instances and N represents the number of non-tumor instances

$$Tumor\;cellularity \left(TC\right)= \frac \times 100\%$$

(1)

2.2 Pre-processing

This study adopted patch sampling, stain normalization, and general augmentation techniques as pre-processing. The patch sampling technique entailed dividing the original 10242 image into multiple 5122 patches with 1/8 overlapping, facilitating comprehensive analysis at a finer level of granularity. Based on the method proposed by Macenko [45], stain normalization was implemented to standardize the color distribution and optimize image consistency. This technique utilized singular value decomposition to separate the hematoxylin and eosin components, with this study setting the precise color adjustments within a specified concentration range of [1.3, 1.5]. The resulting normalized images exhibited a more uniform appearance, effectively mitigating color variations as illustrated in Fig. 2. An adaptive normalization criterion was established to maintain image quality and avoid detrimental effects such as excessive darkening during training. Images with optical density below 20 were exempt from normalization to preserve their original characteristics. Furthermore, several augmentation strategies were integrated to diversify the dataset and enhance model robustness. These included horizontal and vertical flips, rotations at 90°, 180°, and 270° angles, and applying Gaussian blur, collectively enriching the dataset with a wider spectrum of variations for enhanced training efficacy.

Fig. 2figure 2

Exhibits stain normalization images of the colon and pancreas, clearly illustrating their similar stain colors

2.3 Feature ensemble model2.3.1 Model architecture

A novel feature ensemble model is proposed for estimating TC value through cell segmentation and classification. This model integrates two deep learning architectures, Learner1 and Learner2, with Meta-classifier to refine predictions, as shown in Fig. 3. Here, the detailed processing structures of this study are explained as follows:

Learner1: A baseline feature extractor built upon Mask R-CNN [46], utilizing ResNet50 as its backbone to extract hierarchical features from H&E images. It also employs feature pyramids to capture cellular structures and morphology, generating region proposals through the region proposal network (RPN). The fully connected layers and mask branch produce classification scores and segmentation masks.

Learner2: An enhanced version of Learner1, incorporating attention blocks inspired by [47] to improve feature representation. Learner2 introduces three attention blocks at C2, C3 and C4 feature extraction stages to enhance the discriminative power of cellular features while suppressing background noise, highlighted in purple in Fig. 3.

Meta-classifier: A ensemble-based classifier that integrates outputs from both learners to determine the final prediction, generating the ultimate TC estimation. To resolve conflicting outputs, advanced non-maximum suppression (NMS) techniques were applied, prioritizing objects with higher probabilities in overlapping instances to enhance decision-making efficiency.

Fig. 3figure 3

Illustrates the comprehensive workflow of the proposed method, encompassing pre-processing steps such as patch sampling, stain normalization, and common augmentation techniques. This study showcases the use of a homogeneous ensemble learning approach, where Learner1 and Learner2 are leveraged to enhance feature extraction, forming the proposed Feature Ensemble Model. Subsequently, test-time augmentation is applied individually to improve the learners’ predictions, and their aggregated outputs are combined into a Meta-classifier to generate the final prediction

2.3.2 Attention mechanism

To focus on tumor-relevant features, this study integrates three attention blocks at feature extraction stages C2, C3, and C4 of ResNet50 in Learner2. The purple-colored attention block in Fig. 3 is integrated into Learner2’s ResNet50 backbone at these layers, enhancing feature representation. These blocks finally generate feature maps to improve the model’s focus on relevant regions, where the C2 feature map corresponds to AP2, the C3 feature map to AP3, and the C4 feature map to AP4.

The attention block Att is formulated as follows in Eq. (2), while Eq. (3) represents the element-wise multiplication of the attention map, derived from Eq. (2). The feature map Xn from the feature extraction stage, producing a reweighted output that emphasizes relevant regions. The final output, Y, is obtained by element-wise multiplication of the attention map Att with the feature map, as defined in Eq. (3).

$$Att=_\left(_\right))+_\left(_\right))))$$

(2)

Here, Xn−1 and Xn represent feature maps from different feature extraction stages. The attention block applies convolution operations \(\varnothing\) n−1, \(\varnothing\) n with a kernel size (2,2) and stride 2, accompanied by ReLU δ and sigmoid activations σ to suppress irrelevant background noise. Additionally, φ, ∂ are linear transformations implemented as 1 × 1 convolution.

2.4 Loss function

The total loss function of the learners was computed as the sum of the classification loss Lcls, the bounding box loss Lbox and the mask loss Lmask. The classification loss Lcls and mask loss Lmask employed the cross-entropy loss in Eq. (4), where ti represented the probability of ground truth, and pi denoted the predicted probability from deep learning model. The bounding box loss Lbox utilized the smooth L1 loss function in Eq. (5), with j denoting to the number of the targets. Here, tj = (tx, ty, tw, th) represented the coordinate of the ground truth, and pj = (px, py, pw, ph) represented the coordinates of the prediction by the model. The variables x,y represented the center point of the bounding box, while w, h denoted the height and weight of the anchor box.

$$_, _=-\sum_^_\text(_)$$

(4)

$$_=\sum_\left\0.5_-_)}^, if\left|_-_\right|<1\\ \left|_-_-0.5\right|<1, otherwise\end\right\}$$

(5)

2.5 Test-time augmentation

A test-time augmentation strategy was implemented to enhance the model’s adaptability and resilience. Diverging from traditional data augmentation practices performed during training, TTA intervenes during the model’s testing. By generating predictions on various transformations of the input data, TTA furnishes the model with a more diversified set of inputs during testing, bolstering its capacity for generalization. Our approach involved utilizing the original image alongside three distinct flipped versions (horizontal, vertical, and combined) for input data, consolidating the predictions from these variations to refine the model’s output. To strategically optimize performance, the decision was made to exclusively apply test-time augmentation to Learner1 within the proposed Feature Ensemble Model. This targeted application aims to fortify the predictive prowess of the baseline model, Learner1, culminating in an ensemble model that is both robust and reliable.

In 2.3 section, the Feature Ensemble Model was introduced as a pivotal component of our research, aimed at refining the evaluation of tumor cellularity. By combining the outputs of Learner1 and Learner2—each specializing in distinct feature representations—the Feature Ensemble Model facilitated more precise and resilient predictions. The study delved into the stacking method, a notable homogenous ensemble learning technique where base models share congruent structures or algorithms. Additionally, the Meta-classifier was elucidated in a simplified meta-testing setting, underscoring its versatility and capacity to generalize to novel tasks. This approach effectively amalgamated diverse models, unveiling unique insights into the dataset and markedly enhancing the predictive performance of our model.

2.6 Comparison model

The model selected for comparison in this study was the YOLOv8, proposed by Ultralytics [48] renowned for its versatility across various tasks such as object detection, tracking, instance segmentation, image classification, and pose estimation. Noteworthy for its real-time processing efficiency, YOLOv8 stands as a state-of-the-art (SOTA) model that builds upon the successes of its predecessors. A comparative analysis between Mask R-CNN and YOLOv8 focusing on cell identification is essential in providing a comprehensive understanding of object segmentation and classification methodologies. The patch size of the input data was set to 5122, and yolov8n-seg.pt was used as the pre-trained weight of YOLOv8 models in training. The pre-processing steps for YOLOv8 in this study encompassed patch sampling and stain normalization techniques, while post-processing methods within the context of YOLOv8 were not explored, warranting further examination to elucidate its complete functionality and potential enhancements.

2.7 Evaluation

Intraclass correlation coefficients (ICC) are highly recommended for evaluating the reliability of measurement scales, which followed the PAIP 2023 to determine the consistency of calculating TC value between pathologists and deep learning. In addition, this study adopted Mean absolute error (MAE) as the second evaluation, which calculated the difference between the true value and the predicted value. If the MAE gets small, it indicates the prediction is closer to the ground truth.

2.8 Implementation details

The pre-trained weight of the ImageNet ILSVRC 2012 dataset was used in the training. The SGD optimizer was used with a learning rate of 0.001. The model was trained for 120 epochs, and the version selected for testing was the one that achieved the lowest loss. The proposed model was implemented by using Python 3.6.11 using Tensorflow 1.10.0 and Keras 2.2.4 on a Linux system with 1 NVIDIA GeForce GTX 2080 Ti GPU.

Comments (0)

No login
gif