ECVA | European Computer Vision Association

A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining

Bowen Shi, Dongsheng Jiang, Xiaopeng Zhang, Han Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian ;

Abstract

"Transformers have recently shown superior performance than CNN on semantic segmentation. However, previous works mostly focus on the deliberate design of the encoder, while seldom considering the decoder part. In this paper, we find that a light weighted decoder counts for segmentation, and propose a pure transformer-based segmentation decoder, named SegDeformer, to seamlessly incorporate into current varied transformer-based encoders. The highlight is that SegDeformer is able to conveniently utilize the tokenized input and the attention mechanism of the transformer for effective context mining. This is achieved by two key component designs, i.e., the internal and external context mining modules. The former is equipped with internal attention within an image to better capture global-local context, while the latter introduces external tokens from other images to enhance current representation. To enable SegDeformer in a scalable way, we further provide performance/efficiency optimization modules for flexible deployment. Experiments on widely used benchmarks ADE20K, COCO-Stuff and Cityscapes and different transformer encoders (e.g., ViT, MiT and Swin) demonstrate that SegDeformer can bring consistent performance gains."

Related Material

[pdf] [supplementary material] [DOI]