Efficient Decoder-Free Object Detection with Transformers

Peixian Chen, Mengdan Zhang, Yunhang Shen, Kekai Sheng, Yuting Gao, Xing Sun, Ke Li, Chunhua Shen ;


"Vision transformers (ViTs) are changing the landscape of object detection tasks. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is simple yet brings an enormous computation burden during inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection could not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages for the first time. We simplify objection detection to an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT{SMALL} outperforms DETR by 2.5% AP with 28% computation cost reduction and more than 10X fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT{SMALL} obtains over 5.5% AP gain while cutting down 70% computation cost."

Related Material

[pdf] [supplementary material] [DOI]