This study demonstrated that WSUnet learns to localise and segment lung tumours through the comparison of positive and negative images. Thus, the WSUnet architecture serves both as a weakly supervised segmentation method and an explainable image-level classifier.
WSUnet yielded superior voxel-level discrimination to current model interpretation approaches, both by objective and subjective metrics. WSUnet’s voxel-level output identified the voxels motivating the positive image-level prediction, revealing whether the model attended to the tumour or other confounding features. WSUnet offered a distinct advantage of returning predictions in the domain and range of the voxel-level class probabilities, obviating the need for post hoc interpolations and transformations. Thus, WSUnet’s voxel-level output could be interpreted directly as a voxel-probability heatmap.
Although WSUnet’s voxel-level recall did not challenge the state-of-the-art set by fully supervised NSCLC segmentation models trained under full supervision [33], its high precision presents a plausible avenue for object localisation. The low recall performance of WSUnet’s voxel-level predictions provides insight into its reasoning — the model may deduce that the image is positive by finding any tumour region, permitting image-level classification by a small discriminative region of interest. Thus, a positive image-level prediction may be inferred without observing the whole tumour region. Conversely, the whole image must be considered to exclude a tumour. Thus, the model is negatively biased at the voxel level, predisposing it to low recall. This is an important limitation of applying model interpretation methods for weakly supervised segmentation – the model may learn to classify the image using a small discriminative region, leading to undersegmentation. Concurrently, clinicians observed that the voxel-level tumour annotations provided in the Stanford/VA dataset included significant proportions of peritumoural lung parenchyma, which were not segmented by WSUnet, partially explaining apparent under-segmentation performance.
WSUnet’s voxel-level performance was noted to vary between subsequent training epochs, despite stable image-level loss. Furthermore, voxel-level performance appeared to be sensitive to initialisation and early training conditions, as models fitted to different folds demonstrated different voxel-level metrics despite similar image-level performance. These findings demonstrate the limitations of image-level supervision for model selection.
As the saliency map aims to approximate model reasoning, false positive regions typically represent model-misspecification — where the model classified the image on the basis of non-tumoural objects. Conversely, these may represent valid pathobiological associations such as atelectasis. In either case, inspection of the voxel-level predictions improves understanding of the model’s reasoning. However, where the project objective is tumour segmentation, these extra-tumoural pathobiological associations may adversely affect performance by providing an alternative discriminative region.
Although GradCAM predictions localised moderately well to the tumour, their utility was limited by low resolution. Integrated gradient outputs were not locally consistent, such that adjacent voxels typically had dissimilar predictions. Occlusion sensitivity results demonstrated little variance between images. All methods were limited by producing an output which could not be interpreted directly as a voxel-probability map WSUnet is a CNN which returns both an image-level decision and a voxel-level segmentation which motivated the decision. This development facilitates model inspection, debugging, reliability testing, inference and pathobiological discovery. The approach differs from traditional model explainability methods, as the image-level prediction is simply the maximal voxel-level probability. Consequently, voxel-level predictions are interpretable as class probabilities, providing a causally verifiable explanation for the image-level decision. The simple relationship between voxel-level predictions and image-level predictions allows for easy clinical interpretation.
Recent years have seen significant advances towards achieving weakly supervised segmentation for lung CT data. Fruh et al. evaluated class-activation mapping for weakly supervised segmentation of tumours in PET-CT data, attaining a dice score of 0.47 [34]. PET integration may have facilitated the segmentation task, as simple threshold-based segmentation achieved a dice score of 0.29 [34]. Feng et al. applied a global average pooling method to the higher layers of an encoder network to perform weakly supervised segmentation on a lung cancer dataset, achieving high dice scores (0.46–0.54) [35]. The resolution of voxel-level predictions was limited by that of the convolutional layer used for the global average pooling, as interpolation was required to upsample the predictions to the input resolution. Shen et al. proposed a two-stage semi-supervised segmentation approach for lung nodule segmentation, utilising adversarial learning to minimise the discriminability of unsupervised segmentation masks from supervised masks [36]. Laradji et al. proposed consistency-based loss for weakly supervised segmentation modelling of COVID-19-related pneumonitis, where point-level supervision was available [37].
This retrospective study included model evaluation on multi-centre data which was geographically distinct from training data. Training and evaluation datasets included CT images from multiple scanner manufacturers. The study has some limitations. All participants in this study were diagnosed with lung cancer. Consequently, some malignant changes may have been evident in images which did not contain any tumour voxels. In the test data, peritumoural regions were included in tumour segmentation labels, leading to an underestimation of the models’ sensitivity to tumour tissue. Ground truth voxel-level segmentations were employed to identify positive images during the construction of the weakly supervised dataset. The class distribution in this study was approximately balanced at image level and moderately imbalanced at voxel level — the convergence of weak learners may be less reliable in highly imbalanced data. In this study, data was labelled at the level of 128 × 128 axial image patches, whilst clinical applications ideally require tumour localisation in 3D volumes of 512 × 512 image slices. Consequently, further research on the scalability of the method to large, imbalanced datasets is required for clinical utility in typical applications.
In conclusion, this study demonstrated that weakly supervised segmentation is a valid approach by which explainable object detection models may be developed for medical imaging. WSUnet generates a full-resolution voxel-level explanation for its image-level decision, which clinicians found more useful than current model interpretation approaches in application to lung tumour detection. Further research will investigate approaches to improve WSUnet’s voxel-level recall and achieve stable convergence in highly imbalanced data [21,22,23, 37].
Comments (0)