Artificial CT images can enhance variation of case images in diagnostic radiology skills training

Data

As a use-case for this study, we used the publicly available Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) data set [11,12,13]. While radiographic lung images are readily available at sufficient volume for training, our choice was deliberate. First, this data set contains a large number of clinical CT scans: 1018 diagnostic and screening scans, both with and without pathological features (lung nodules). Second, the images were obtained using either standard or low-dose CT. Third, relevant lesions and features were annotated by four experienced thoracic radiologists via consensus, though images were not annotated with lesion attributes. LIDC-IDRI comprises CT scans from different institutions and individuals, making it a diverse dataset in terms of patient demographics. However, specific demographic information about the patients, such as age, sex, and ethnicity, are not available publicly due to privacy concerns [13]. Thus, this data set provided a unique opportunity to test the reliability, realism, and utility of synthetic images for diagnostic radiology training; our methods (and results; below) are agnostic to the image modality, body part, or pathological features.

Upon retrieval, slices in each scan were harmonized by clipping the Hounsfield Units at [−1350, 150], the lung window [14]. 2D slices were retained at their original resolution of 512 × 512 pixels, and 17% of all 2D slices contained lung nodules. Detailed information about scan selection and pre-processing of 2D images can be found in Appendix 1 section A1.

Semantic image synthesis

Semantic image synthesis networks, a type of AI, provide a unique opportunity to realize precision education and to improve diagnostic skills. These networks constitute a specific type of generative adversarial network. They rely on additional “semantic” information to create (“synthesize”) the gross image features. This information is embedded in an image or map through labeling of, for instance, separate organs or pathological features. This way information can be introduced to the synthetic image that was different (or even not present) in the original, actual image (see Fig. 1).

Fig. 1figure 1

Workflow of obtaining the annotation maps and training and testing the semantic image synthesis network. The semantic image synthesis network consists of three models (light blue): the encoder, the generator, and the discriminator. It takes an annotation map as input and is additionally guided by an original image. The feedback from the discriminator allows all three models to learn during training and is discarded during evaluation

In this work, we used the network developed by Park et al. [15], which seeks to better preserve the semantic information in the synthetic image. In this context, semantic information is described as the information about which pixel belongs to which object or group in an image. We chose this network mostly because of its potential to create a variety of images and partly because of the ease of implementation, not necessarily to create the “most realistic” synthetic image. The pipeline for training and evaluation of the network is depicted in Fig. 1. Further details of the training and network parameters are described in Appendix 1 section A1. All code necessary to replicate our results is provided at https://github.com/UT-RAM-AIM/Realism-Study.

Annotation maps

The semantic image synthesis network utilizes a map of pre-determined features to create images similar to, but distinct from the original images. This allows them to be guided to manipulate arbitrary features (a.k.a. “ annotation maps”; in our case, anatomical and pathological characteristics) of the output synthetic image. In this work, we used annotation maps akin to a segmentation map, that reflect each of the major objects in the images. It is the same size as the 2D image slices, 512 × 512 pixels, and can offer guidance and constraints to the shape and location of the target features.

The annotation maps in this work included five labels. We algorithmically segmented the original LIDC-IDRI images to obtain annotations that identify the full body, soft tissue, dense tissue, and total lung area (Appendix 1 section A1). Manual annotation provided by LIDC-IDRI was used to delineate lung nodules, if any, as the fifth label. These five labels were deemed to contain enough information, based on known anatomy, to guide the semantic synthesis network.

Original images and corresponding annotation maps were randomly split into a training and validation set and an independent test set. Details about obtaining the annotation map, data split, and selection of the slices are described in Appendix 1 section A1.

Quality evaluation

To assess whether synthetic images (primary set) can be used along with the original ones, we created additional synthetic images of lesser quality than the primary set. Specifically, as negative controls, we created a set of synthetic images that are of reasonable quality, but have serious flaws (control set 1), and another set that was obviously not real (control set 2). This was critical to evaluate the extent to which the primary synthetic image set can blend in with the original image set, relative to those with deliberately low realism and those that are obviously unrealistic. To this end, we trained the network for a second and third time with, respectively, 2% and 0.3% of the main training data set. Details about data split and network training for both subsets can be found in Appendix 1 section A1. This way, we ensured that the apparent quality of the primary set was not due to other intrinsic factors (e.g., attention paid by the radiologists, their experience, etc.).

The degree to which the synthetic images are similar to original images can be assessed both quantitatively and qualitatively. However, for quantitative metrics, there is no consensus on a single best metric and their validity in clinical setting [16, 17]. In particular, quantitative metrics may not always reflect expert judgment [18,19,20,21]. Though qualitative, perception of domain experts is a valid, reliable, and interpretable approach that also signifies clinical relevance of the synthetic images [8, 19, 20, 22]. To determine if these four sets ((1) original images, (2) primary synthetic set, (3) synthetic control set 1, and (4) synthetic control set 2) are distinguishable, the sets of images were evaluated both quantitatively and qualitatively.

As the main quantitative metric, we used the Structural Similarity Index Measure (SSIM). The SSIM is based on pairwise comparisons, i.e., comparison of the original image with the corresponding synthetic image. It ranges from 1 to 0 representing, respectively, identical and completely dissimilar images. This choice was based on the consideration of interpretability of the metric: since multiple types of metrics are available that assess different properties of the synthetic image compared to the original one, we also derived four other common metrics; see Appendix 1 section A2. The impact of the size of the training set in the primary synthetic set and the two control sets was tested for statistical significance using a one-way ANOVA followed by a post-hoc Tukey’s test. We tested the hypothesis that the primary synthetic set will achieve a SSIM score closer to 1 compared to control sets 1 and 2.

In addition to quantitative measures, we sought expert opinion based on previous approaches [19]. Five radiologists were asked to assess 60 quartets of 512 × 512 images. Three out of the five were board-certified radiologists in the Netherlands and one was board-certified in the USA, all with > 10 years of experience in thoracic CT. The fifth was a radiology fellow in The Netherlands with 3 years of experience in thoracic CT. Every quartet contained one original image, selected randomly from the test data set, and three synthetic images (primary set, control set 1, control set 2). All synthetic images were generated using the same original annotation map that corresponds to the original image. The radiologists were presented with a quartet, with the location of each image within the quartet assigned randomly. The radiologists were blinded to which image was the original one. First, the radiologists were asked to indicate the image that is the original image. Second, the radiologists were asked to score the quality of each image in a given quartet on a scale from 1 (unrealistic) to 4 (almost indistinguishable from the original image). Ordinal regression was used to test the ranking of the expert rating across original and synthetic images created. We tested the hypothesis that radiologists are able to distinguish original and synthetic images.

Manipulating annotations

To explore the capabilities of the semantic image synthesis network, we manipulated information in the annotation map that corresponds to the main pathology, lung nodules, as it is the most clinically salient feature. In particular, we used removal, insertion, and relocation of the lung nodule label in the annotation map. Through this approach, we investigated if the resulting synthetic images also adhere to these new constraints.

A second option to provide guiding information is to provide an original image to the network. In this case, the new synthetic image will have the appearance of that particular guiding original image. This adjusted synthetic image will still adhere to the annotation map and therefore only reflect variability due to, e.g., scanner differences such as visibility of a gantry or overall intensity in the image. We also explored this “example-guided synthesis” by guiding the network with different original images, while keeping the annotation map the same.

留言 (0)

沒有登入
gif