Classification of rib fracture types from postmortem computed tomography images using deep learning

Ethics

The data used in this retrospective cohort study are in accordance with Swiss laws and ethical standards. The ethics approval for this study was waived by the Ethics Committee of the Canton of Zurich (KEK ZH-No. 15–0686).

Case selection

A total of 340 consecutive autopsy cases were retrospectively retrieved from July 2017 to April 2018 from the archives of the Institute of Forensic Medicine, University of Zurich, Switzerland. We excluded cases with signs of advanced decomposition (using the RA-index defined by Egger et al. [17]), corpses that had undergone organ explantation, cases of severe trauma with extensive damage to the corpse (e.g., amputation or exenteration), cases without whole-body PMCT, cases where rib fractures were not visible in the rib unfolding tool or located in the cartilaginous part of the rib, and cases that were still under investigation during this period. After these exclusion criteria were applied, a total of 195 cases remained (55 females, median age 64 years; 140 males, median age 54 years). Of the 195 cases, 85 showed acute rib fractures, 84 had no rib fractures, and 26 presented subacute and chronic fractures either in combination with acute fractures or independently. Both complete and incomplete rib fractures were included, independent of their location. They were classified as either “displaced,” “nondisplaced,” “ad latus” (sideways), “ad axim” (with angulation), “ad longitudinem cum contractione” (in long axis compressed fracture), and “ad longitudinem cum distractione” (in long axis with gap between the fragments) fractures.

Postmortem computed tomography data

Whole-body imaging was performed on a 128-slice dual source CT scanner (SOMATOM Flash Definition, Siemens, Forchheim, Germany) using automated dose modulation software (CARE Dose4D™, Siemens, Forchheim, Germany); the slice thickness was 1 mm, and the increment was 0.5 mm. The images were reconstructed with both soft and hard kernels. A complete overview of the technical parameters used to acquire the CT scans can be found in Flach et al. [18].

Image treatment prior to classification

The rib fracture images were reconstructed from volumetric CT data using Syngo.via rib unfolding tool CT Bone Reading (Siemens Healthineers GmbH, Erlangen, Germany) with standard bone window setting (center 450, width 1500) (see Fig. 1 for more details). The tool used for this conversion was developed by Ringl et al. [19].

Fig. 1figure 1

Workflow of the automated rib fracture classification pipeline. Each volumetric PMCT scan of the rib cage was transformed into a corresponding 2D representation. If the representation did not display any fracture (“no fracture”), we collected a series of sample images (each measuring \(99\times 99\) pixels) using a sliding window. Then, we randomly drew from a subset of those samples. If the representation displayed rib fractures, we collected a sample at the exact position of the fracture with an additional set of 16 samples. The additional set was obtained using data augmentation by sliding the \(99\times 99\)-pixel window in each of the four cardinal directions in 10-pixel steps. The samples from the four fracture types and the “no fracture” samples were fed into a ResNet50 architecture for training and testing. We validated the performance of our model on three levels of hierarchical taxonomy: (1) a high-level task where the model distinguished between “fracture” and “no fracture,” (2) a mid-level task to assess how well the model could classify “nondisplaced” and “displaced” fractures, and (3) a low-level task to validate the performance of the model in classifying the three different types of displaced fractures “ad latus” (sideways), “ad longitudinem cum contractione” (in long axis compressed fracture), or “ad longitudinem cum distractione” (in long axis with gap between the fragments)

Data mining

To extract data containing fractures, we used 270 images of unfolded rib cages with fractures. Two readers, one who was a medical student under supervision and one who was a board-certified forensic pathologist and radiologist, classified each fracture type as either “displaced” or “nondisplaced.” The “displaced” fractures were further divided into “ad latus” (sideways), “ad axim” (with angulation), “ad longitudinem cum contractione” (in long axis compressed fracture), and “ad longitudinem cum distractione” (in long axis with gap between the fragments). Due to the very small number of “ad axim” fractures, we excluded them from further analysis. First, we cropped the images to \(500\times 1000\) pixels to eliminate the background and then upscaled the images to 300% of the original size with the INTER_AREA interpolation method from OpenCV, resulting in large images measuring \(1500\times 3000.\) With this preprocessing step, we wanted to achieve an optimal size for dividing the image into sufficient image patches but still capturing all fractures. All fractures were marked using their respective x- and y-coordinates on the large image. For each large image containing one or more fractures, we applied data augmentation by shifting the sliding window from the centered x- and y-coordinates in all four cardinal directions (up, down, right, and left) in steps of 10 pixels. This resulted in a total of 16 additional samples next to the original sample (centered around the fracture). For each fracture, we then manually removed the sample images where the data augmentation resulted in a loss of information (e.g., the fracture was no longer visible). The sample curation led to 11,759 “displaced” (“ad latus” 1785, “longitudinem cum contractione” 6801, and “longitudinem cum distractione” 3173) and 18,462 “nondisplaced” samples, for a total of 30,251 “fracture” images.

To extract samples with the label “no fracture,” we used 231 images of unfolded rib cages without any fractures. As for the images with fractures, we applied the same preprocessing steps (cropping and resizing) to images without fractures. Employing a sliding window of size \(99\times 99\) pixels and shifting it 25 pixels in each direction along both the \(x\)- and \(y\)-axes, we obtained 231,926 small images, each of which was \(99\times 99\) pixels in size. From these images, we randomly selected 30,251 “no fracture” images, resulting in a balanced dataset of 60,472 samples in total.

Training, validation, and testing

For our study, we used a Windows workstation (Windows 10, Nvidia GeForce GTX 1660 SUPER, 64 GB CPU RAM). We split our data into ~ 70% training and ~ 30% test data. Representations from the same fracture were kept together in each partition to prevent data leakage into the test set; thus, the partitions varied slightly in size. We then ran a 5-fold cross-validation on the training dataset with different hyperparameters. We selected the best hyperparameters (see Section “Model architecture and hyperparameters”) by assessing the epochs with the highest validation score (F1 score). Finally, we trained our model with the best selection of hyperparameters on the full training dataset and validated the trained model on the test set. We assessed three levels of hierarchical taxonomy (see Fig. 1 for more details):

1.

Performance of the model on the balanced binary task when classifying “no fracture” and “fracture” and reported with the accuracy score (high-level task).

2.

Performance of the model on the imbalanced binary task when classifying “displaced” and “nondisplaced” with the F1, precision, and recall scores (mid-level task).

3.

Performance of the model on the imbalanced multiclass task with the displaced classes “ad latus,” “ad longitudinem cum contractione,” and “ad longitudinem cum distractione” with the F1, precision, and recall scores (low-level task).

Additionally, we defined two types of assessment:

1.

Performance measurement on the fracture representations (referred to as “standard” assessment), as in simple image classification tasks.

2.

Aggregation of the prediction values from multiple representations of the same fracture into a single prediction value. The aggregation procedure starts by running a custom-made function \(Y\) on the predicted values. The function \(Y\) is defined as

$$Y=\left\0,& \mathrm\;\sum\limits_^}_=0\\ 1,& \mathrm\end\right.$$

where the variable \(}_\) stands for the label value predicted by the model for the representation \(i\). The variable \(}_\) can take any integer value from 0 to \(c\), where \(c\) represents the number of classes. Hence, the function \(Y= 0\) if at least one of the representations \(i\) was classified into the class 0 (classified as “no fracture”). Otherwise, the function \(Y= 1\) if at least one of the representations \(i\) was classified into a nonzero class (classified as “fracture”). Then, we used the maximum operator to determine the fracture type \(k\) when \(Y= 1\):

$$k= \underset}(\frac\sum_^}_^)$$

where the \(}_^\) stands for the model output value for the class \(c\) before entering the Softmax function. In other words, the aggregated prediction value corresponding to a single fracture is the type of fracture (class) that has the highest weight over all its representations. This would ensure us that we have detected a fracture even with the weakest signal. We referred to this type of assessment as “aggregated.”

Model architecture and hyperparameters

We used the ResNet50 architecture [20] pretrained on the ImageNet database combined with two additional dense layers, each with 198 neurons, and with a dropout layer whose dropout rate was 0.5. Additionally, we included the EarlyStopping function to stop the training when the value of the validation loss function was minimal (patience = 15). We also used the ReduceLROnPlateau function to downscale the learning rate when the validation loss value was not improving (patience = 2) [21]. The batch size was set to 16, and we used the categorical cross-entropy loss function with the Adam optimizer. We first froze the layers of the pretrained network and trained on our data for several epochs (max = 100 epochs, depending on early stopping) with a learning rate of 0.0001. Then, we unfroze the layers and fine-tuned the network for another few epochs (max = 100 epochs, depending on early stopping) with a learning rate of 8e − 05.

留言 (0)

沒有登入
gif