We have worked with three different public datasets, namely, BCD [25], B [26], and BUSIS [27]. BCD contains 780 images classified as normal (133), benign (487), or malignant (210); B has 163 images, labeled with the tumor type (cyst, fibroadenoma, etc.) and the malignancy classification (110 benign and 53 malignant); and BUSIS has 562 images with no label.
Our study revealed some limitations in BCD that are not reported in the literature. First, in images with several nodules, some of them are not segmented, especially simple cysts. Furthermore, using the Space-Invariant Feature Transform (SIFT) algorithm to detect zoomed or rotated copies of the same image [28], it was found that 150 images had at least one almost identical copy, 8 of them in a different class (see Fig. 1): 6 of these were classified as both benign and malignant, and the other 2 as benign and normal. For this reason, 189 images were discarded. SIFT did not detect images corresponding to the same nodule taken at different times during the ultrasound scan (see Fig. 1). We also excluded these images from our dataset. We applied SIFT to the other two datasets, finding 2 duplicate images in B and 8 in BUSIS. After cleaning duplicates and images corresponding to the same tumor, we assumed that the remaining images corresponded to different tumors. Our cleaning helped to avoid having the same or very similar tumors in training and testing, which could generate bias.
Additionally, to avoid the “Clever Hans phenomenon” [29], which consists in producing correct classifications based on “spurious” features, images with extra information, such as color maps or delimited ROIs, were only used for training (because they improved the results), not for validation or testing.
Fig. 1Detection of duplicates. The two images in the upper row are detected by SIFT as copies in the BCD dataset, one was labeled as benign and the other as malignant. The images in the lower row contain the same nodule across time; SIFT does not detect them as duplicates, but we only used one of them for description and classification
All three databases have numerous simple cysts, which are the easiest to detect and describe; in fact, our model trained with only 75 images (25 of them were simple cysts) was able to detect and describe them correctly. We discarded most of these simple cysts and focused on more complex nodules.
In conclusion, we obtained from these datasets a total of 749 images: 154 from B, 339 from BCD, and 256 from BUSIS, with their corresponding malignancy classification, if available (306 were benign and 177 malignant). Given that these public datasets do not contain BI-RADS descriptors or ROIs, the images were annotated by one of the authors (MPJ), a breast radiologist with more than 30 years of experience. More information about the descriptors we used can be found in Appendix A.
ArchitectureThe core of our system is a model consisting of two elements: a multi-class classification network with an attention mechanism that returns the BI-RADS descriptors and the Boolean malignancy classification, and a multinomial logistic regression that returns the BI-RADS classification. It is preceded by a YOLO module that obtains the ROIs and followed by a rule-based model that fine-tunes the results and gives the final output in natural language.
Detecting ROIs with YOLOWe use YOLO [30] to detect the ROIs, i.e., the nodules, in each image. Since this is a fully convolutional algorithm, it can take different images of different sizes and shapes as input. YOLO is very fast, and its eighth version, YOLOv8, runs at 50 frames per second. If the width or height of a detected ROI is higher than 450, the image is resized keeping the width-height ratio, which is relevant for some descriptors, such as the orientation. We then use zero-padding to fill the ROIs to 450\(\times\)450 pixels. This padding does not affect the CNN training procedure and has no negative impact on time performance [31].
Extracting BI-RADS DescriptorsAs mentioned above, the core of our system is the model shown in Fig. 2. The first of its two elements is a multi-class classification algorithm, which takes as input each nodule extracted by YOLO and outputs its BI-RADS descriptors and the Boolean malignancy classification. The multi-class classification algorithm has an encoder consisting of a convolutional layer with max-pooling and GELU activation function [32], and a VGG16 [11] with an output size of 14\(\times\)14\(\times\)512; i.e., for each image, it yields a 196\(\times\)512 feature-space matrix, \(\textbf\), which is then batch-normalized.
Fig. 2This model receives as input a ROI, passes it through a convolutional layer with max-pooling and a VGG16, which gives the feature space, and normalizes it. Then, the attention classification network returns the tumor descriptors, such as “oval”,“circumscribed”, etc., as well as the type of tumor (“fibroadenoma”) and the Boolean malignancy classification (“benign”). A multinomial logistic regression uses all these features, except the Boolean malignancy classification, to yield the BI-RADS classification
The attention component of the classification network computes a weighted average of the feature space, known as context [33], calculated as follows:
$$\begin \text }=\mathbf }=\sum _^a_\cdot \textbf_\;, \end$$
(1)
where \(\textbf_\) (a vector of dimension 512) is the i-th row of \(\textbf\) and \(\textbf\) is a vector of weights, such that
$$\begin a_=\frac}\cdot \textbf_))}^\tanh (}\cdot \textbf_))}\;. \end$$
(2)
V is a weight matrix learned during training. This means that we first input each feature-space vector \(\textbf_\) into a perceptron with a hyperbolic tangent (tanh) activation function that returns its “importance” in a range from \(-1\) to 1. We then concatenate all of them into a softmax activation function that gives an ordered output of these “importances,” \(a_\), ranging from 0 to 1, which add up to a total of 1, so that we can calculate the weighted average of the feature space. Since the \(\textbf_\)’s are the output of the convolutional encoder and we calculated the “importance” of each one, \(a_\), these weights indicate the regions of the image to which the model has paid more attention.
This context, \(\textbf\), is the input to six dense layers, one for each BI-RADS descriptor: shape, margin, orientation, echogenicity, posterior features, and halo. Each layer has one output neuron for each possible value of the descriptor, with a sigmoidal activation function for orientation, and a softmax for the other descriptors. The sigmoidal function was chosen because some nodules are neither parallel nor anti-parallel (for example, round tumors); therefore, our model only provides this descriptor when the result of the sigmoid function exceeds a certain threshold, empirically set to 0.3.
The dense layer that gives the suggestivity or tumor type, which in Fig. 2 returns the label “fibroadenoma,” receives the output from the layers of the six BI-RADS descriptors (because the suggestivity of a tumor depends on them) and the context, \(\textbf\). We assigned to this layer a softmax activation function and an extra label, “no clear suggestivity,” because for 484 of the 749 nodules, the radiologist could not choose a value for this descriptor—a sigmoid activation function would have returned a label for those nodules, regardless of the threshold.
Finally, the dense layer for the Boolean malignancy classification (which in Fig. 2 returns the label “benign”) receives the descriptors, the context, and the suggestivity as inputs and combines then with a softmax.
BI-RADS Multinomial Logistic RegressionThe second part of the model is a multinomial logistic regression that receives the descriptors and outputs the nodule final BI-RADS classification (the label “BI-RADS 3” in Fig. 2). We did not incorporate this into the multi-class classification model because it worsened the performance. In addition, basing the output only on the descriptors and not on the image itself allows the system to explain the classification. One of the benefits of using a multinomial logistic regression is the simplicity and explainability of the algorithm. For example, Fig. 3 shows the “importances”/weights of the descriptors for BI-RADS 3 and 4A for the example in Fig. 2.
Fig. 3Descriptors’ weights for the example in Fig. 2. Red points are the weights of BI-RADS 3 output, while blue points are the weights of the second option BI-RADS 4A
Generating Natural Language Descriptions with a Rule-Based ModuleFinally, two rules taken from the latest edition of the BI-RADS standard [34] are applied to fine-tune the output: (1) if the nodule is round, it has no orientation, neither parallel nor anti-parallel; (2) when the nodule is classified as a simple cyst, a complex cyst, or is spiculated, the BI-RADS classification is set to 2, 4A, and 5, respectively. The first rule applies to the orientation output of the multi-class classification network, eliminating its output. The second rule only applies to the BI-RADS multinomial logistic regression model. Additional rules are used to generate a natural language description, as shown in Fig. 4, but do not modify the model results.
Fig. 4A rule-based module fine-tunes the output and generates natural language descriptions
Experiments Detecting Nodules with YOLOWe first tested the ability of the YOLO’s preprocessing module to detect nodules. We randomly selected 600 of the 749 nodules and used the images for training, augmenting them with YOLOv8’s facilities, which consist of random scaling, color space augmentations, and mosaic data loader (merging more than one image into one); and 149 nodules for testing. When training, the images were resized to 480\(\times\)480 pixels; YOLOv8 automatically applies zero-padding to maintain the image height-width ratio. We analyzed whether the performance of our multi-class classification network decreases when taking as input the ROIs trimmed by the YOLO module instead of those segmented manually.
Describing and Classifying the NodulesWe compared the BI-RADS descriptors and the BI-RADS classification from our model with those of the expert and the Boolean malignancy classification from our model with the ground truth recorded in the B and BCD datasets. We also compared with this ground truth our expert’s Boolean malignancy classification (considering tumors with BI-RADS 4A or lower as benign and the others as malignant). We analyzed the impact of each component of the system on the performance; in particular, we studied different versions of the system:
1.A model consisting only of a plain classifier pre-trained on ImageNet (namely VGG16, ResNet [10], DenseNet [35], MobileNet [36],
2.The residual attention network [37] used as base for the model in [21] and pre-trained on ImageNet,
3.A model that only has the attention layer and the Boolean malignancy classification layer (the one that outputs the label “benign” in Fig. 2); i.e., it dispenses with all the other layers in the classification network so that the output of the attention mechanism is the only input of the Boolean malignancy classification layer, and
4.Our model receiving the images directly, i.e., without preprocessing them with the YOLO detector.
We also tried out the pre-trained DenseNet model on the RadImageNet dataset [38], which we named RAD-DenseNet. More information about the architectures and hyperparameters of all models can be found in Appendix B.
Our experiments also tested VGG19 as an alternative to VGG16. The results were usually slightly worse, but the difference was very small.
We have performed two series of experiments. In the first one, we did two repetitions of 10-fold cross-validation with the 600 manually selected ROIs used to train YOLO. One of the folds was never used for validation, since it contained images with extra-information, such as color maps, delimited ROIs, and tumor segmentation. We augmented the images with
Random enlargement or reduction of the ROI size in the original image: \([-0.1:0.25]\) times the size of the ROI vertically and \([-0.1:0.15]\) horizontally.
Random zoom of the ROI: \([-0.3:0.3]\) times the size of the ROI.
Random contrast and brightness alterations: [0.8 : 1.2] contrast and \([-25:25]\) brightness alterations.
Random horizontal flips.
Random rotations, limited to a maximum range of \(0.05*2\pi\) radians to preserve tumor orientation.
In the second series of experiments, called “testing,” we used the 149 ROIs detected by YOLO and repeated the experiment 5 times.
We recall that for the BI-RADS category classification, we used the same multinomial logistic regression function, trained with the descriptors and categories given by our expert. Therefore, the weights of the algorithm were the same for all the models. The results in validation and test of the BI-RADS category will indicate the quality of the combination of the descriptors given by these models and will thus also assess their performance.
Comments (0)