AdaptFRCNet: Semi-supervised adaptation of pre-trained model with frequency and region consistency for medical image segmentation

Medical image segmentation involves delineating anatomical structures, organs, or lesions from medical images to facilitate clinical analysis and diagnosis, including tasks such as skin disease diagnosis (Wu et al., 2021a) and fundus screening (Fu et al., 2018). With the emergence of convolutional neural networks (CNNs), medical image segmentation has undergone significant advancements (Ronneberger et al., 2015, Zhou et al., 2018, Li et al., 2018). Existing approaches in the medical domain heavily depend on high-quality and abundant labeled data, which is often labor-intensive and time-consuming. However, collecting sufficient labeled data can be challenging in real-world applications, particularly for tasks involving pixel-wise segmentation.

Recently, Transformers pre-trained on large-scale datasets have demonstrated impressive capabilities in computer vision (Shen et al., 2023, Zou et al., 2023, Xie et al., 2021). Hence, pre-training followed by fine-tuning proves to be an effective approach for adapting these large pre-trained models (LPM) to downstream tasks (Hu et al., 2021, Jia et al., 2022). However, fine-tuning all parameters of LPM incurs significant computational and storage costs across various downstream tasks. While a straightforward approach is to fine-tune only the task-specific layers, this often yields unsatisfactory results as it fails to fully leverage the pre-trained features of LPM. Hence, parameter-efficient fine-tuning (PEFT) strategies have been developed in computer vision (Jia et al., 2022, Lian et al., 2022, Liu et al., 2023b, Chen et al., 2022). With just a few trainable parameters, these strategies can achieve performance comparable to fine-tuning all parameters of LPM, particularly when labeled data is limited.

Moreover, high-quality labeled data is often scarce in the medical domain, particularly for pixel-level segmentation tasks. Fine-tuning with this limited data can easily result in overfitting (Kirkpatrick et al., 2016), leading to a drop in performance. In clinical practice, there is an abundance of unlabeled data available, and reliance on labeled data can be mitigated if these data can be effectively utilized. Hence, developing semi-supervised learning (SSL) methods based on LPM holds promise for reducing the annotation burden in the medical domain. Consequently, numerous efforts have been made in SSL (Tarvainen and Valpola, 2017, Sohn et al., 2020). They introduce data-level consistency (Li et al., 2020b, Wu et al., 2021b) and model-level consistency (Tarvainen and Valpola, 2017, Ouali et al., 2020, Luo et al., 2021b), and generate pseudo labels using the model’s predictions (Chen et al., 2021).

A natural question arises regarding how to effectively and efficiently utilize both labeled and unlabeled data for SSL based on LPM, a topic which has not been fully explored in previous works and faces the following challenges. Firstly, in the context of parameter-efficient fine-tuning (PEFT) for LPM, adapter-based methods such as AdaptFormer (Chen et al., 2022) failed to consider feature calibration, consequently hindering the effective distinction of the importance among different feature channels. Furthermore, prompt-tuning-based techniques like VPT (Jia et al., 2022) combine prompts with image features, resulting in distractions and an inability to effectively aggregate features into the prompts. Secondly, current SSL methods primarily concentrate on the RGB domain, neglecting the frequency domain. Many lesions in medical images exist in regions with uniform image intensity, characterized by low-frequency signals. Segmenting lesions using solely RGB domain information poses a significant challenge (Zhong et al., 2022) without explicitly accounting for low-frequency information. Hence, our goal is to explore the rich information present in both frequency and RGB domains. Finally, most previous methods emphasize pixel-level consistency, overlooking semantic consistency at the region level. Incorporating multiple region-level consistencies can effectively address lesions of varying sizes and enhance the modeling ability for multi-scale features.

In this paper, we propose AdaptFRCNet, a semi-supervised Adaptation framework for pre-trained models with Frequency and Region Consistency, specifically designed for medical image segmentation. AdaptFRCNet comprises three key components: frequency domain consistency (FDC), multi-granularity region similarity consistency (MRSC), and an attention-based adapter, Att_Adapter, which can be seamlessly integrated into the frozen LPM for parameter-efficient fine-tuning. For FDC, we utilize Discrete Cosine Transform (DCT) (Ahmed et al., 1974) to convert the pre-trained RGB domain features into the frequency domain. To discern cues related to lesion information in the frequency space, we devise a frequency enhancement module (FEM) with Transformer (Vaswani et al., 2017) to encode features in the frequency domain. This enhances the capabilities of LPM in processing frequency and high-frequency signals. As for MRSC, our aim is to provide multi-granularity regional context information, rather than solely focusing on pixel-wise consistency. For SSL, we employ the Mean Teacher (MT) (Tarvainen and Valpola, 2017) framework with LPM as the backbone, and integrate the three proposed key components to form AdaptFRCNet for semi-supervised medical image segmentation. Our work offers the following three main contributions.

For PEFT, we propose lightweight Att_adapter with a few trainable parameters to adapt LPM to the medical field, and it significantly reduces computational costs while leveraging the powerful feature representation capabilities of LPM.

We introduce FDC and MRSC consistency regularization strategies for SSL, enabling learning from unlabeled data to address insufficient annotation. MRSC is advantageous for medical image segmentation with diverse shapes and scales, while FDC effectively utilizes frequency signals to enhance segmentation performance.

Our method is plug-and-play and can be seamlessly integrated with existing SSL methods. During inference, FDC and MRSC can be safely removed, thus further reducing inference complexity. Finally, our method outperforms the current state-of-the-art (SOTA) semi-supervised methods on three publicly available medical datasets.

This paper is an extended version of our MICCAI 2024 early accepted work (He et al., 2024). In particular, we extend it from the following three aspects compared with our conference version: (1) we propose a novel lightweight Att_adapter with a few trainable parameters to efficiently transfer the LPM to downstream medical tasks. We compare our Att_adapter with other parameter-efficient fine-tuning methods to verify the effectiveness of our proposed module. Compared to other methods, our method combines the advantages of prompt tuning and adapter, and also adopts attention mechanism to calibrate the features. (2) In addition, more detailed analysis, more quantitative and qualitative ablation studies are conducted to comprehensively analyze the effectiveness of the proposed method. (3) Our original framework is called FRCNet in our conference version (He et al., 2024). In this extended version, we name our new network with Att_adapter as AdaptFRCNet. To evaluate the performance of the improved version, we added a challenging dataset for fundus lesion segmentation and 3D CT data for multi-organ segmentation, where the lesions are small and the class and scale changes greatly, and thus can well verify the segmentation ability of AdaptFRCNet. We also present the failure case analysis and discuss the possible solutions, ensuring continuous improvement and optimal performance in the future.

Comments (0)

No login
gif