Generalizable attention U-Net for segmentation of fibroglandular tissue and background parenchymal enhancement in breast DCE-MRI

Patient data

This retrospective study has been approved by the local ethics committee. The main parameters of the datasets utilized in this study are provided in Table 1. Datasets 1–3 originate from our institution, whereas dataset 4 is a small subset of public Duke-Breast-Cancer Dataset [31]. All the data was acquired in a transverse plane in a prone position with fat-saturation of the DCE T1 sequences.

Table 1 Main parameters of the three datasets used in this work

To curate datasets 1–3, we conducted a search in the Picture Archiving and Communication System (PACS) of our institution. The search aimed to identify DCE-MRI examinations fulfilling following inclusion criteria: (a) age above 18 years, (b) absence of implants, (c) availability of assessments for FGT and BPE classes in the corresponding radiological report, and (d) BI-RADS Assessment Category indicating a likelihood of malignancy of 1 or 2, corresponding to a 0% likelihood [8]. Examinations lacking proper fat saturation or suffering from motion blur were excluded. The FGT and BPE classes, determined via a board-certified radiologist, were extracted from the radiological report.

Dataset 1 was curated by searching through examinations acquired between September 2013 and October 2015 using a 3.0-T scanner. This dataset was used for model training, validation, and testing. For volumetric analysis, the search through examinations acquired between September 2020 and October 2022 continued until a total of 80 eligible examinations resulting in possibly well-equalized distributions of FGT and BPE classes were identified. In this way, dataset 2, containing examinations acquired with 1.5-T scanner, and dataset 3, containing examinations acquired with 3.0-T scanner, were obtained. Examinations from dataset 2 were additionally used for testing models’ performance.

AI model development design

The design of AI model development differed from previous works by utilizing two separate models: one for FGT segmentation from the native volume, the other one for BPE segmentation from the subtraction volumes. This approach allows for an investigation of all subtraction images, without the need for precise registration. This is important, as even a small misalignment between sequences (cf. Fig. S1) can have a significant impact on volumetric analysis, especially in cases with almost entirely fat FGT and minimal BPE. Another advantage of using two separate models is that potential errors in FGT segmentation do not affect BPE segmentation. For instance, artifacts or extreme superior and inferior regions with higher intensity on native volume may be mistakenly included in the FGT segmentation.

As the FGT and BPE are very complex and fine structures, it is important that the predicted mask has similar resolution to the resolution with which the MRI data was acquired. Hence, we opted for the highest resolution of our data, i.e., 448 × 448, as the input and output size. To accommodate a reasonable batch size for training with NVIDIA GeForce RTX 3090 (24 GB), we chose to train the models slice by slice (Fig. 1). Our approach utilized a 2D implementation of the attention U-Net from the repository of Yingkai Sha [32] with spatial 2D dropout layers added in each convolution stack. The additional advantage of using a 2D model is that it can effectively incorporate volumes with varying numbers of slices and any necessary rescaling is performed solely in 2D.

Fig. 1figure 1

Schematic representation of the model development pipeline. Two independent attention U-Net models are trained: the first one is trained to segment the fibroglandular tissue (FGT) and the fatty tissue from native DCE data; the second one is trained to segment BPE and non-enhancing tissue from the subtraction data. This separation ensures accurate segmentation even for not well-registered cases. In both cases, the segmentation is performed slice-wise ensuring that with the chosen hardware, the predicted mask has high resolution able to accurately capture the intricate details of the FGT and BPE structures (Icons made by Freepik and Netscript from flaticon.com)

Ground truth masks

The ground truth masks were created in 3D Slicer [33] by S.N.. Firstly, the breast was segmented without skin by the use of Grow from Seeds algorithm, Gaussian smoothing, and fine-tuning with Paint and Erase. Afterwards, the FGT and BPE were segmented by thresholding, customized to each volume followed by the fine-tuning allowing also for artifacts’ removal. A sample of final segmentation masks was verified by A.L. (resident in radiology with more than 3 years of experience in breast imaging) and A.B. (board-certified radiologist with over 15 years of experience in breast imaging). Due to the time and resource constrains, intra- and inter-reader variability investigation was not performed.

Dataset splitting

Dataset 1 was split into a patient-stratified train-validation-test sets, and the FGT model was trained using 2112 slices from 20 patients for training, 416 slices from 4 patients for validation, and 520 slices from 5 patients for testing. The model was additionally evaluated on dataset 2 and dataset 4, which comprised 1004 slices from 6 patients. Importantly, the subtraction volumes exhibit lower contrast and signal-to-noise ratio compared to native volumes. To account for this, a larger amount of data was utilized for the BPE model. The total training set for the BPE model comprised 11,829 slices from 54 patients, with a validation set of 2469 slices from 12 patients, and a testing set of 3095 slices from 16 patients. Subsequently, the BPE segmentation model was tested on dataset 2 and dataset 4, which collectively contained 2672 slices from 6 patients.

Model training

All the data were rescaled to 0–1 range prior to training. A subset of the dataset was used for hyperparameter tuning using fivefold cross validation. The best hyperparameters obtained in this way were then fine-tuned during training using the entire dataset. Five rounds of training of native and subtraction models were then performed using best fine-tuned hyperparameters. Noteworthy, the best performance was achieved with focal Tversky loss [34] function harshly penalizing the false negatives by setting the α parameter of the loss to 0.99 and the β parameter to 0.01. Additionally, during training, brightness augmentation in the 0.2–1.8 range delivered best performance on the test set. All final hyperparameters together with average inference runtimes are reported in Table S1.

Model evaluation

The model obtained in each training run was evaluated on the test data coming from three datasets: dataset 1, 2, and 4 (cf. Table 1). Firstly, the evaluation was centered around application-relevant metrics, i.e., breastvol, FGT(%) (1)/BPE(%) (2), derived from ground truth and predicted masks.

$$_=\frac_}_} 100\%$$

(1)

$$_=\frac_}_} 100\%$$

(2)

Their correlation was plotted, followed by linear fit and calculation of Pearson correlation coefficient (r). Secondly, the volumetric DSC was computed for the breast and the FGT/BPE masks. Additionally, a weighted DSC was calculated, with the weights proportional to FGT(%)/BPE(%). This adjustment was made to account for the higher penalization of small shifts in case of lower FGT(%)/BPE(%). The models were additionally evaluated with Bland–Altman plots. Lastly, the overlays of the ground truth and the predicted masks were assessed visually.

Volumetric analysis

The best performing models were used to quantify the density of the healthy breast tissue and its percentage taking up the contrast using datasets 2 and 3. FGT(%) according to Eq. (1) and BPE/FGT (%) according to Eq. (3) were calculated from the predicted masks.

$$_=\frac_}_} 100\%$$

(3)

Next, the correlation between those quantitative measures and qualitative assessment by radiologists was assessed by using Spearman correlation coefficient (ρ), taking into account errors in the calculation of FGT(%) and BPE/FGT(%). These were calculated by propagation of uncertainty from mean absolute errors of the native and subtraction model using the test set.

All of the evaluation was performed by S.N..

留言 (0)

沒有登入
gif