HyFormer: a hybrid transformer-CNN architecture for retinal OCT image segmentation

Qingxin Jiang; Ying Fan; Menghan Li; Sheng Fang; Weifang Zhu; Dehui Xiang; Tao Peng; Xinjian Chen; Xun Xu; Fei Shi

doi:10.1364/BOE.538959

HyFormer: a hybrid transformer-CNN architecture for retinal OCT image segmentation

Biomed Opt Express. 2024 Oct 2;15(11):6156-6170. doi: 10.1364/BOE.538959. eCollection 2024 Nov 1.

Authors

Qingxin Jiang¹, Ying Fan², Menghan Li², Sheng Fang¹, Weifang Zhu¹, Dehui Xiang¹, Tao Peng³, Xinjian Chen^{1

4}, Xun Xu², Fei Shi¹

Affiliations

¹ MIPAV Lab, School of Electronic and Information Engineering, Soochow University, Suzhou 215006, China.
² Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China.
³ School of Future Science and Engineering, Soochow University, Suzhou 215222, China.
⁴ State Key Laboratory of Radiation Medicine and Protection, Soochow University, Suzhou 215123, China.

Abstract

Optical coherence tomography (OCT) has become the leading imaging technique in diagnosing and treatment planning for retinal diseases. Retinal OCT image segmentation involves extracting lesions and/or tissue structures to aid in the decisions of ophthalmologists, and multi-class segmentation is commonly needed. As the target regions often spread widely inside the retina, and the intensities and locations of different categories can be close, good segmentation networks must possess both global modeling capabilities and the ability to capture fine details. To address the challenge in capturing both global and local features simultaneously, we propose HyFormer, an efficient, lightweight, and robust hybrid network architecture. The proposed architecture features parallel Transformer and convolutional encoders for independent feature capture. A multi-scale gated attention block and a group positional embedding block are introduced within the Transformer encoder to enhance feature extraction. Feature integration is achieved in the decoder composed of the proposed three-path fusion modules. A class activation map-based cross-entropy loss function is also proposed to improve segmentation results. Evaluations are performed on a private dataset with myopic traction maculopathy lesions and the public AROI dataset for retinal layer and lesion segmentation with age-related degeneration. The results demonstrate HyFormer's superior segmentation performance and robustness compared to existing methods, showing promise for accurate and efficient OCT image segmentation. .