Optical implementation and robustness validation for multi-scale masked autoencoder

(i) Multi-scale patch sizes: The patch sizes, or to say, the potential spatial correlations, may affect the quality. Therefore, the model is refined, converting the patch size into 8 × 8, 12 × 12, 14 × 14, 16 × 16, 24 × 24, and 32 × 32 pixels and following the size range of previous studies.5,235. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021), pp. 9650–9660.23. R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, Montreal, QC, Canada, 2021), pp. 7242–7252. The corrupted and blocked input images were resized to the integral multiple of corresponding patch sizes. The total input image size persists at approximately the same (224–256). The model is trained and tested in NVIDIA RTX3080 with 10 GB GPU memory. As illustrated in Table I, when the patch size declines, the computation cost increases at O(n2) rate. The cost was formidable (out of memory) when the size reached 4 × 4. In addition, its computational complexity is significantly larger than that of other models24–2624. Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. Van Gool, “Coarse-to-fine sparse transformer for hyperspectral image reconstruction,” in Computer Vision–ECCV 2022: 17th European Conference, Proceedings, Part XVII, Tel Aviv, Israel, 23–27 October, 2022 (Springer, 2022), pp. 686–704.25. Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. Van Gool, “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 17502–17511.26. Z. Meng, M. Qiao, J. Ma, Z. Yu, K. Xu, and X. Yuan, “Snapshot multispectral endomicroscopy,” Opt. Lett. 45, 3897–3900 (2020). https://doi.org/10.1364/ol.393213 in related fields, which usually range from 10 GFLOPS to 100 GFLOPS.(ii)

Denoise ability: To enhance its robustness and generality in real situations and to examine its denoise ability, the input images are preprocessed by adding Gaussian noises, which is called degradation and is widely used in conventional semi-supervised or self-supervised denoising. Under this circumstance, the modified MAE aims to reconstruct the whole image instead of merely the missing patches.

(iii) Evaluations: For datasets in simulation, the reconstructed images were evaluated by the peak-signal-to-noise (PSNR) and structure similarity (SSIM).27,2827. Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process. 13, 600–612 (2004). https://doi.org/10.1109/tip.2003.81986128. A. Horé and D. Ziou, “Image quality metrics: PSNR vs SSIM,” in 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, . These two metrics are compelling assessment criteria for image denoising and reconstruction. To measure its performance, Davis2017,2929. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 DAVIS challenge on video object segmentation,” arXiv:1704.00675 (2017). a relatively small dataset for unsupervised and semi-supervised learning in object segmentation, is chosen for testing and checking its average PSNR and SSIM. The 480p test-challenge part contains 2180 854 × 480 pixel sequential pictures, including several objects, details, and actions. Owing to the fixed input size of the network, a sliding window is adapted with details as shown in Fig. 3. Each time, a slice of the image is generated for the model and the final output is gathered to consist the reconstructed image. During the process, contiguous slices share 23 pixels to smooth the sharp boundary.(iv) Adjustment for real condition: Unlike MAE, which conveys the mask to the decoder directly, the optical MAE fails to align each patch of mask when confronted with real imaging. To be specific, it is hard to adjust the pictures captured and make the input mask and predefined mask equal. To tackle the problem, an additional module and several tricks on imaging are added to detect the mask accurately. First, mask patch sizes bigger than the input of the model are imposed by DMD. A larger area is captured by the optical system, and the photo is then compressed into befitting sizes. In this way, the distortion that skewed images bring is eliminated to a large extent. In addition, when figures and objects are presented, complex illumination and interference fringes disturb mask detection modules. Blank plates with masked patches are captured in advance. In this way, each masked patch of every masked blank plate is easy to localize and map to the corresponding measurement by conventional adaptive local filters and threshold methods. The process is illustrated in Fig. 4.

Comments (0)

No login
gif