Pneumonia is a prevalent respiratory disease, with cases surging since the COVID-19 pandemic. Infected lung regions typically exhibit congestion, edema, and inflammatory responses, which manifest in medical images as varying degrees of patchy shadows, ground-glass opacities, and consolidations. The lesions may also exhibit diffuse distribution, leading to irregular boundaries between infected regions, which further increases diagnostic complexity and the workload of manual diagnosis. Computer-aided techniques can alleviate this burden by automatically identifying and segmenting infected lung regions, providing more accurate and efficient diagnostic support. With the rapid advancement of deep learning, medical image segmentation research based on U-Net (Ronneberger et al., 2015) or Vision Transformer (ViT) architectures (Dosovitskiy, 2020, Chen et al., 2022, Rahman et al., 2024, Cai et al., 2024, Zhang et al., 2023, Wang et al., 2023, Liu et al., 2024, Zhang et al., 2024) has achieved remarkable progress. However, the varying density and morphology of infected regions in medical images make it difficult for models to identify them accurately without expert guidance. Medical image segmentation methods based solely on single-modality data remain unsatisfactory. Since medical reports are typically generated alongside lung scan images and provide rich lesion-related information (Marini et al., 2024), leveraging textual descriptions from medical reports to guide lung infection region segmentation has emerged as a competitive solution (Zhang et al., 2022, Huang et al., 2021, Li et al., 2024, Zhong et al., 2023, Marini et al., 2024). Incorporating textual information enables the model to better understand and localize infection regions, thereby reducing the reliance on extensive annotations.
As shown in part (a) of Fig. 1, existing text-guided image segmentation models primarily involve two core steps: feature extraction from multimodal data (green and yellow boxes) and feature alignment and fusion (blue box). In the feature extraction phase, existing methods (Li et al., 2024, Zhong et al., 2023) typically encode the textual information from medical reports and the corresponding image data separately. In general, pre-trained text encoders extract semantic features from textual information, while image encoders capture deep representations of images. The primary task in the feature fusion phase is to identify the semantic correspondences between text features and image features. For instance, Li et al. (2024) intuitively adopt a feature addition approach for merging, with the combined features subsequently fed into a ViT model for deep feature fusion. Furthermore, Zhong et al. (2023) integrate text and image features using a cross-attention mechanism, which allows semantic information from text features to be transferred to image features, resulting in a fine-grained information fusion. Although these methods significantly outperform traditional single-modality approaches, they still have certain limitations.
First, existing text feature extraction methods (Tomar et al., 2022, Zhang et al., 2022, Radford et al., 2021, Huang et al., 2021, Li et al., 2024, Zhong et al., 2023) fail to fully exploit the high-value information contained in medical reports. According to i in Fig. 1(a), certain location clues in textual descriptions, such as the position of the lung infection in the lower or upper left lung, can assist in medical segmentation. Existing models, however, fail to extract these important location details. To address this, the innovative concept of a “text view” is proposed, which converts textual descriptions of infection locations into probabilistic location maps, introducing the spatial localizations of infected regions into the model. Text view brings three significant advantages: (1) The text-view method, based on the probabilistic location of infected regions, provides an explicit way to mine textual information. Compared with implicit text feature extraction methods (Radford et al., 2021, Huang et al., 2021, Li et al., 2024, Zhong et al., 2023), it provides more precise information about the location of infected regions. This ensures that the model can focus effectively on subtle but critical information within the text while substantially reducing uncertainty in utilizing textual information. (2) When textual information is converted into a text view, text and image data can be mapped into the same space, so an image encoder can extract features directly from the text view. With this unified mapping, the modality gap introduced by pre-trained text encoders in existing methods is eliminated, reducing the difficulty of fusing different types of information. (3) This approach exploits textual information maximally to locate lesion regions, with the textual information serving as an additional supervisory signal, providing extra directional constraints to the model’s learning process. It reduces the model’s reliance on annotated data and improves its performance in semi-supervised settings.
Second, in existing models (Zhang et al., 2022, Radford et al., 2021, Huang et al., 2021, Li et al., 2024, Zhong et al., 2023), text and image features are derived from different data sources and pre-trained models, resulting in semantic discrepancies between features from different modalities. Such discrepancies hinder the effective transfer of information across modalities, as illustrated by ii in Fig. 1(a). Some studies Li et al., 2024, Zhong et al., 2023, Zhang et al., 2022, Radford et al., 2021 and Huang et al. (2021) attempt to address this issue by aligning features from different modalities. For instance, LanGuide (Li et al., 2024) and LViT (Zhong et al., 2023) adopt a re-projection approach to align text and image dimensions. However, this simple projection approach often leads to the omission of some important information. In some works Zhang et al., 2022, Radford et al., 2021 and Huang et al. (2021), contrastive learning is used to align features from different modalities. Due to the alignment objectives inherent in contrastive learning, these models struggle to perceive detailed information in lengthy texts and recognize subtle differences in similar images. However, these abilities to perceive textual and visual details are critical for fine-grained segmentation tasks, directly impacting segmentation accuracy. Considering that the goal of alignment is to transfer information across modalities effectively, we transform the “fusion” problem of text and image features into a “knowledge transfer” problem from text-view features to image features. This approach differs from the above fusion methods by integrating textual information into the image features without any loss of information, thus enhancing the image features with textual data. As a result, it enables the model to leverage information-rich image features for high-precision lesion segmentation.
This paper proposes an intuitive and effective Text-View Enhanced Knowledge Transfer Network (TVE-Net) for lung infection region segmentation. In order to quantify the ambiguity in the text description, TVE-Net introduces a probabilistic location function that converts the textual information from medical reports into probabilistic representations of infection regions. Building upon this, a self-supervised constraint based on text-view overlap and feature consistency is proposed to enhance feature quality, thus enhancing the image features with textual data. As a result, it enables the model to leverage information-rich image features for high-precision lesion segmentation. The integrated design not only enhances the ability of the TVE-Net model to recognize and segment lung infection regions but also improves its generalization capability and performance in semi-supervised settings through the self-supervised feature enhancement mechanism. Experimental results show that the TVE-Net achieves excellent segmentation performance in both fully supervised and semi-supervised settings, notably attaining state-of-the-art results on the QaTa and MosMedData+ datasets with a quarter of the training labels.
In summary, our main contributions are four-fold:
•A new TVE-Net is proposed that utilizes lesion location information from medical reports to construct a text view, enhancing the model’s ability to accurately detect lesion locations. This approach introduces an innovative method for text-guided image segmentation.
•A flexible probabilistic function is designed to establish a mapping between textual descriptions of location and the probability of infection region locations in the image. By generating adaptive probability maps through learnable parameters, this function enables the model to develop robust textual understanding capabilities.
•A self-supervised constraint based on text-view overlap and feature consistency is proposed to enhance feature extraction. It leverages the supervisory role of textual information and the structural similarity between medical images to improve both feature extraction robustness and the model’s semi-supervised performance.
•A multi-stage knowledge transfer module is developed to transfer knowledge from text-view features to image features. This module integrates location and structural information of infection regions into the segmentation features, while effectively suppressing irrelevant information.
Comments (0)