A comparative study on the implementation of deep learning algorithms for detection of hepatic necrosis in toxicity studies

Animal experiments

To induce hepatic necrosis in Sprague–Dawley (SD) rats, we conducted acute oral toxicity tests as explained previously [17]. Male and female SD rats (Crl:CD) were obtained from Orient Bio, Inc. (Gyeonggi, Korea) at 9 weeks old. Animals were allowed to acclimate for 2 days prior to the beginning of the study. Throughout the experiments, the rats were maintained under controlled conditions (23 ± 3 ℃, 30–70% relative humidity, 12 h light/12 h dark cycle of 150–300 lx, 10–20 cycles/h ventilation) and provided a standard rat pellet diet (gamma-ray irradiated; 5053 PMI Nutrition International, San Francisco, CA, USA) ad libitum. The animals had free access to municipal tap water that had been filtered and UV-irradiated. This water was analyzed for specific contaminants every 6 months by the Daejeon Regional Institute of Health and Environment (Daejeon, Korea).

An acute oral toxicity study was performed according to the Korea ministry of food and drug safety (MFDS) Test Guideline 2017–71 [23]. Animals were randomly assigned to the following three groups (n = 10 per group, 5 males and 5 females): a control group, a single-dose APAP group, and a repeated-dose APAP group. APAP (A7085, 99.0% purity; Sigma-Aldrich, MO, USA) was administered orally to induce acute liver injury in 10-week-old SD rats using two dosing systems: a single dose of 2,500 mg/kg or a 6-day repeated dose of 1000 mg/kg. Doses of APAP were chosen from previously published reports [24, 25]. Immediately prior to administration, 2500 or 1000 mg of APAP was dissolved in 10 mL of sterile distilled water. The administration was performed at 10 mL/kg per dose. Sterile distilled water was administered as vehicle control. The day of the starting dose was regarded as day 1. Single-dose (including vehicle control) and 6-day repeated-dose animals were euthanized by isoflurane inhalation on days 3 and 7, respectively. Liver tissues were collected in 10% formaldehyde. Hematoxylin and eosin (H&E) staining was performed using the left lateral and median lobes of paraffin-embedded livers, and sections were used for digital archiving. The experiment was approved by the Association for Assessment and Accreditation of Laboratory Animal Care International and the Institutional Animal Care and Use Committee (Approval ID: 20–1-0265). All the animal treatments followed the Guide for the Care and Use of Laboratory Animals for animal care [22].

Data preparation

Slides of liver sections were prepared by three different research centers (Korea Institute of Toxicology, ChemOn Inc., and Biotoxtech) to account for any variation in staining and sectioning techniques. WSIs of liver sections were scanned using an Aperio ScanScope XT (Leica Biosystems, Buffalo Grove, IL, USA) with a 20 × objective lens and bright-field illumination. The scan resolution was 0.5 μm per pixel, and the images were saved as TIFF strips with JPEG2000 image compression. The data preparation for necrosis was performed as previously described [16]. Next, the 20 × magnified WSIs were cropped into 448 × 448-pixel tiles. A total of 500 image tiles were obtained from 14 WSIs, which showed hepatic necrosis among the selected 193 WSIs. All lesions on the acquired image tiles were labeled using a VGG image annotator 2.0.1.0 (Visual Geometry Group, Oxford University, Oxford, UK), with 510 annotations per 500 tiles. The lesions were characterized using nuclear dissolution and fragmentation with pale eosinophilic cytoplasm in the image and hemorrhage (Online Resource 1). These annotations were confirmed by an accredited toxicologic pathologist before the algorithm training was initiated. The lesions identified in these images were labeled and used to train and test the algorithms. The train_test split function embedded into the scikit-learn package was used to split the annotated image tiles into training, validation, and test datasets (ratio of 7:2:1, respectively). Data augmentation, to improve the training dataset, was performed 16 times using a combination of image-augmenting techniques (reverse, rotation, and brightness). The total number of images and annotations used for training, validation, and testing were 5,600, 100, and 50 and 5,680, 104, and 51, respectively (Online Resource 2).

Training of algorithms and metrics for model performanceModel structure

Three algorithms that have demonstrated great performance in recognizing the object of interest in images in various ways were trained (Fig. 1). Mask R-CNN (Fig. 1a), an instance segmentation model, was developed from Faster R-CNN. It is one of the best-known detection-based segmentation models and uses an ROI alignment (ROI align) with bilinear interpolation to increase the number of anchors and mask branches needed to achieve instance segmentation [15]. DeepLabV3+ (Fig. 1b) is a semantic segmentation model that uses the Xception model and applies the depth-wise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules. Atrous Spatial Pyramid Pooling controls the resolution of features computed by the network by adjusting the field-of-view of the filter to capture multiscale information, allowing the network to explicitly generalize standard convolution operations. Therefore, DeepLabV3+ is a faster and stronger encoder-decoder network [14]. Finally, SSD (Fig. 1c), an object detection model, has a base network of VGG16 and an additional auxiliary network. When connecting the two networks, the detection speed is improved by replacing the fully connected layer with a convolutional layer. The SSD model includes a feature map obtained from the convolution layer in the middle of the convolution network and uses a total of six different scale feature maps for prediction. Moreover, for each cell in the feature map, the position of the object is estimated using the default box, which is a bbox with a different scale and aspect ratio. According to this procedure, the SSD has high speed and accuracy as a 1-stage detector with an integrated network using various views [7]. By training these three algorithms, we attempt to investigate the optimal deep learning algorithm for detecting hepatic necrosis in non-clinical studies.

Fig. 1figure 1

Structures of deep learning networks used in this study. The structure of Mask R-CNN (a), DeepLabV3+ (b), and SSD (c)

Model training

All procedures related to the algorithms’ training were performed using an open-source framework for machine learning (TensorFlow 2.1.0 using Keras 2.4.3 backend, and PyTorch) powered by an NVIDIA RTX 3090 24G GPU. Open-source packages for each algorithm (Mask R-CNN: torchvision [26], DeepLabV3 + : jfzhang95 pytorch-deeplab-xception package [27], SSD: amdegroot ssd.pytorch package [28]) were applied and their requirements were met in this study. Hyperparameters tuned for algorithm learning were adjusted accordingly (Online Resource 3) and each loss calculated according to the algorithm during the training was recorded and saved.

Loss

Loss in machine learning is the loss that occurs due to model estimation error when a learned model is applied to real data. Therefore, models with smaller losses offer a better prediction. In the case of object detection and segmentation for image analysis, various losses are calculated according to the type of algorithm. The total loss of Mask R-CNN is the sum of the classifier, box, mask, objectness, and region proposal network losses. The total loss of DeepLabV3+ is the result of calculating cross-entropy loss compared to the ground truth. In the case of SSD, localization loss and confidence loss are summed components of the total loss.

Metrics for model performance

After training, each model calculates the mean intersection over union (IoU) by comparing the ground truth annotation to the predicted lesion according to each model’s trained weight from the test dataset. The IoUs calculated from the images in the test dataset were averaged and defined as the mean IoU. In the case of SSD, the method for calculating the mean IoU is different from that of the segmentation algorithms. The IoUs of SSD are defined as 1, 0.5, and 0.33 according to the prediction rates of 100%, 50%, and 33% of the number of predicted hepatic necrosis compared to the number of ground truth labels, respectively. Therefore, it is difficult to compare the performances of the three algorithms in terms of the mean IoU. To overcome this limitation and confirm the performance on large-scale images, we calculated and compared the precision, recall, and accuracy when predicting hepatic necrosis in 60 images (2688 × 2688 pixels) larger than the training images. Smaller 448 × 448-pixel tiles were derived from the larger 2688 × 2688 images. To calculate precision, accuracy, and recall values, the ground truth of the test images was annotated using the same procedure as when preparing the training data. The values were defined by the ratio of true positive, false positive, and false negative predictions according to the detected presence or absence of the lesion in each tile compared to the ground truth labels. A schematic diagram of the calculated precision, recall, and accuracy on the larger-scaled test images is depicted in Fig. 2, and the precision, recall, and accuracy are calculated by the following Eqs. (13). In addition, we calculated mask IoUs, which are IoUs from Mask R-CNN and DeepLabV3+, to confirm how precisely the models predicted the lesion area. The mask IoU is calculated by comparing the area of prediction to the ground truth annotations.

Fig. 2figure 2

Procedure for calculating precision, recall, and accuracy values to evaluate each model’s performance in large-scale images. The annotated 2688 × 2688 images are split into 448 × 448-pixel tiles, and each model predicts the presence or absence of the lesion in each tile image. Subsequently, true and false predictions are defined according to the ground truth annotation, and the precision, accuracy, and recall values for each 2688 × 2688 image can be calculated

Comments (0)

No login
gif